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Abstract — A unified view of the area of sparse signal processing 
is presented in tutorial form by bringing together various fields 
in which the property of sparsity has been successfully exploited. 
For each of these fields, various algorithms and techniques, 
which have been developed to leverage sparsity, are described 
succinctly. The common potential benefits of significant reduction 
in sampling rate and processing manipulations through sparse 
signal processing are revealed. 

The key application domains of sparse signal processing are 
sampling, coding, spectral estimation, array processing, compo- 
nent analysis, and multipath channel estimation. In terms of 
the sampling process and reconstruction algorithms, linkages 
are made with random sampling, compressed sensing and rate 
of innovation. The redundancy introduced by channel coding 
in finite and real Galois fields is then related to over-sampling 
with similar reconstruction algorithms. The methods of Prony, 
Pisarenko, and Multiple Signal Classification (MUSIC) are next 
shown to be targeted at analyzing signals with sparse frequency 
domain representations. Specifically, the relations of the approach 
of Prony to an annihilating filter in rate of innovation and Error 
Locator Polynomials in coding are emphasized; the Pisarenko 
and MUSIC methods are further improvements of the Prony 
method. Such narrowband spectral estimation is then related to 
multi-source location and direction of arrival estimation in array 
processing. The notions of sparse array beamforming and sparse 
sensor networks are also introduced. Sparsity in unobservable 
source signals is also shown to facilitate source separation in 
Sparse Component Analysis (SCA); the algorithms developed in 
this area are also widely used in compressed sensing. Finally, the 
nature of the multipath channel estimation problem is shown to 
have a sparse formulation; algorithms similar to sampling and 
coding are used to estimate typical multicarrier communication 
channels. 
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Fig. 1. Sparse discrete time signal with its Discrete Fourier Transform (DFT). 
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Fig. 2. Sparsity is manifested in the frequency domain. 



I. Introduction 

THERE are many applications in signal processing and 
communication systems where the discrete signals are 
sparse in some domain such as time, frequency, or space i.e., 
most of the samples are zero, or alternatively their transform 
in another domain (normally called "frequency coefficients'') 
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is sparse (see Figs. [T] and [2|. There are trivial sparse trans- 
formations where the sparsity is preserved in both "time" and 
"frequency" domains; the identity transform matrix and its 
sorted versions are extreme examples. Wavelet transformations 
that preserve the local characteristics of a sparse signal can 
be regarded as "almost" sparse in the "frequency" domain; in 
general, for sparse signals, the more similar the transformation 
matrix is to an identity matrix, the sparser the signal is in the 
transform domain. In addition, the transform matrix may be 
sparse; wavelet transformation matrices are such examples. 

In any of these scenarios, sampling and processing can be 
optimized using sparse signal processing. In other words, the 
sampling rate and the processing manipulations can be signifi- 
cantly reduced; hence, a combination of data compression and 
processing time reduction can be achieve(£l. 

Each field has developed its own tools, algorithms, and 
reconstruction methods for sparse signal processing. Very few 

^Sparse Signal Processing, Panel Session organized and chaired by F. 
Marvasti and lectured by Profs. E. Candes, R. Baraniuk, P. Marziliano, and 
Dr. A. Cichoki, ICASSP 2008, Las Vegas, May 2008. 
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TABLE I 

Common notations used throughout the paper. 



n Length of original vector 

k Order of sparsity 

m Length of observed vector 

X Original vector 

s Corresponding sparse vector 

y Observed vector 

u Noise vector 

A Transformation matrix relating s to y 



authors have noticed the similarities of these field^. It is 
the intention of this tutorial to describe these methods in 
each field succinctly and show that these methods can be 
used in other areas and applications often with appreciable 
improvements. Among these fields are l-Sampling: random 
sampling of bandlimited signals [2], Compressed Sensing (CS) 
[3], and sampling with finite rate of innovation [4]; 2- Coding: 
Galois [5], [6] and real-field error correction codes [7]; 3- 
Spectral Estimation [8], [9], [10], [11]; 4- Array Processing: 
Multi-Source Location (MSL) and Direction Of Arrival (DOA) 
estimation [12], [13], sparse array processing [14], and sensor 
networks [15]; 5- Sparse Component Analysis (SCA): blind 
source separation [16], [17], [18] and dictionary representation 
[19], [20], [21]; 6- Channel Estimation in Orthogonal Fre- 
quency Division Multiplexing (OFDM) [22], [23], [24]. The 
sparsity properties of these fields are summarized in Table 
InFl . The details of each application will be discussed in the 
next sections but the common traits will be discussed in this 
introduction. 

The columns of Table HIl consist of 0- category, 1- topics, 2- 
sparsity domain, 3- type of sparsity, 4- information domain, 
5- type of sampling in information domain, 6- minimum 
sampling rate, 7- reconstruction method, and 8- applications. 
The first rows (2-7) of column 1 are on sampling techniques. 
The 8-9^^ rows are related to channel coding, row 10 is 
on spectral estimation and rows 11-13 are related to array 
processing. Rows 14-15 correspond to SCA and finally, row 
16 covers multicarrier channel estimation, which is a rather 
new topic. As shown in column 2 of the table, depending on 
the topics, sparsity is defined in time, space, or "frequency" 
domains. In some applications, the sparsity is defined as the 
number of polynomial coefficients (which in a way could be 
regarded as "frequency"), the number of sources (which may 
depend on location or time sparsity for the signal sources), 
or the number of "words" (signal bases) in a dictionary. 
The type of sparsity is shown in column 3; for sampling 
schemes, it is usually low-pass, band-pass, or multiband [25], 
while for compressed sensing, and most other applications, 
it is random. Column 4 represents the information domain, 
where the number of sparsity, locations, and amplitudes can 
be determined by proper sampling (column 5) of this domain. 

problems that arise in interpolation, spectrum analysis, error-control 
coding, and fault-tolerant computing. We believe that the relations between 
these problems have gone nearly unnoticed so far" [1]. 

list of acronyms is given in Table IXVl at the end of the paper. 



The other columns are self explanatory and will be discussed 
in more details in the following sections. 

The rows 2-4 of Table [III are related to the sampling 
(uniform or random) of signals that are bandlimited in the 
Fourier domain. Band-limitedness is a special case of sparsity 
where the nonzero coefficients in the frequency domain are 
consecutive. A better assumption in the frequency domain is 
to have random sparsity [26], [27], [28] as shown in row 
5 and column 3. A generalization of the sparsity in the 
frequency domain is sparsity in any transform domain such as 
Discrete Cosine Transform (DCT) and wavelets; this concept 
is further generalized in compressed sensing (row 6) where 
sampling is talcen by a linear combination of time domain 
samples [3], [29], [30], [31]. Sampling of signals with finite 
rate of innovation (row 7) is related to piecewise smooth 
(polynomial based) signals. The position of discontinuous 
points is determined by annihilating filters that are equivalent 
to error locator polynomials in error correction codes and 
Prony's method [28] as discussed in Sections [nil and HVl 
respectively. 

Random errors in a Galois field (row 8) and the additive 
impulsive noise in real-field error correction codes (row 9) 
are sparse disturbances that need to be detected and removed. 
For erasure channels, the impulsive noise can be regarded 
as the negative of the sample value [32]; thus the missing 
sampling problem, which can also be regarded as a special 
case of nonuniform sampling, is also a special case of the error 
correction problem. A subclass of impulsive noise for 2-D 
signals is salt and pepper noise [33]. The information domain, 
where the sampling process occurs, is called the syndrome 
which is usually in a transform domain. In addition, for special 
binary codes such as Low Density Parity Check (LDPC) codes, 
the parity check matrix is extremely sparse. Sparse matrix 
inversions and manipulations [34], [35], [36] are utilized in 
Discrete Wavelet Transform (DWT) and LDPC decoding. 

Spectral estimation (row 10) is the dual of error correction 
codes, i.e., the sparsity is in the frequency domain. MSL 
(row 11) and multi-target detection in radars are similar to 
spectral estimation since targets act as spatial sparse mono- 
tones; each target is mapped to a specific spatial frequency 
regarding its line of sight direction relative to the receiver. 
The techniques developed for this branch of science is quite 
unique; with examples such as MUSIC [8], Prony [9], and 
Pisarenko [10]. We shall see that the techniques used in real- 
field error correction codes and SCA can also be used in this 
area. 

The array processing category (rows 11-13) consists of 
three separate topics. The first one covers MSL in radars and 
sonars and DOA, which are similar to spectral estimation. 
The techniques developed for this field are similar to the 
spectral estimation methods with emphasis on the Minimum 
Description Length (MDL) [37]. The second topic in the array 
processing category is related to the design of sparse arrays 
where some of the array elements are missing; the remaining 
nodes form a nonuniform sparse grid. In this case, one of the 
optimization problems is to find the sparsest array (number, 
location and weight of elements) for a given beampattern. 
This problem has some resemblance to the missing sampling 



3 



TABLE II 

Various topics and applications with sparsity properties: the sparsity, which may be in time/space or "frequency" domains, 
consists of unknown samples/coefficients that need to be determined. the information domain consists of known 

SAMPLES/COEFFICIENTS IN "FREQUENCY" OR TIME/SPACE DOMAIN (THE COMPLEMENT OF THE SPARSE DOMAIN). A LIST OF ACRONYMS IS GIVEN IN 
TABLe IXVI aT the end of the paper; also, a list of common notations is presented in TABLeU For definition of esprit ON ROW 1 1 AND 

COLUMN 7, SEE THE FOOTNOTE ON PAGeFTS] 
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Category 


Topics 


Sparsity 
Domain 


Type 
of 
Sparsity 


Information 
Domain 


Type of 
Sampling in 
Info. Domain 


Min Number 
of Required 
Samples 


Reconstruction 
Method 


Applications 


2 




Uniform 
sampling 


Frequency 


Lowpass 


Time/Space 


Uniform 


2 X BW - 1 


Lowpass 
filtering / 
Interpolation 


A/D 


3 




Nonuniform 
sampling 


Frequency 


Lowpass 


Time/Space 


Missing samp- 
-les/Jitter/Per- 
-iodic/Random 


2x BW -1 
(in some cases 
even BW) 


Iterative Metho- 
-ds/Filter banks/ 
Spline Interp. 


Seismic / 
MRI / CT/ 
FM / PPM 


4 


Sampling 


Sampling of 
multiband 
signals 


Frequency 


Union of 
disjoint 
intervals 


Time/Space 


Uniform/Jit- 
-ter/Periodic/ 
Random 


2 X ^BW 


Iterative metho- 
-ds/Filter banks/ 
Interpolation 


Data 
Compression/ 
Radar 


5 




Random 
sampling 


Frequency 


Random 


Time/Space 


Random/ 
Uniform 


2 X E #Coeff. 


Iterative methods: 
Adapt. Thresh. 
RDF / FLP 


Missing Samp. 
Recovery/ 
Data Comp. 


6 




Compressed 
sensing 


An arbitrary 
orthonormal 
transform 


Random 


Random 
mapping of 
Time/ S p ac e 


Random 
mixtures 
of samples 


c - k- log(f ) 


Basis pursuit/ 
Matching 
pursuit 


Data 
compression 


7 




Finite 
rate of 
innovation 


Time and 
polynomial 
Coeff 


Random 

and 
uniform 


Filtered 

time 
domain 


Uniform 


# Coeff. + 1 + 
2 • (# Discont. 
Epochs) 


Annihilating 
filter 


ECG/ 
OCT/ 
UWB 


8 


Channel 


Galois 
field 
codes 


Time 


Random 


Syndrome 


Uniform 

or 
random 


2 X # errors 


Berlekamp 
-Massey/Viterbi/ 
Belief Prop. 


Digital 
communic- 
-tion 


9 


coding 


Real 
field 
codes 


Time 


Random 


Transform 
domain 


Uniform 

or 
random 


2 X # Impulsive 
noise 


Adaptive 
thresholding 
pr»TH / ThT p 

ivUrl/ / CL^r 


Fault 
tolerant 
system 


10 


Spectral 
estimation 


Spectral 
estimation 


Frequency 


Random 


Time / 
Autocor- 
-relation 


Uniform 


2 X # Tones 
-1 


MUSIC/ 
Pisarenko/ 
Prony / MDL 


Military/ 
Radars 


11 




MSL/ 
DOA 
estimation 


Space 


Random 


Space/ 
Autocor- 
-relation 


Uniform 


2x 
# Sources 


MDL+ 
MUSIC / 
ESPRIT 


Radars/ 
Sonar/ 
Ultrasound 


12 


Array 
processing 


Sparse arr- 
-ay beam- 
-forming 


Space 


Random/ 
Missing 
elements 


Space 


Peaks of 
sidelobes/ 
[Non] Uniform 


2 X # Desired 
array 
elements 


Optimiz- 
-ation: LP/ 
SA / GA 


Radars/sonar/ 
Ultrasound/ 
MSL 


13 




Sensor 
networks 


Space 


Random 


Space 


Uniform 


2x BW 
of random 
field 


Similar 
to row 5 


Seismic/ 
Meteorology/ 
Environmental 


14 


SCA 


BSS 


Active 
source/Time 


Random 


Time 


Uniform 


2 X # Active 
sources 


£i / £2/ 
SLO 


Biomedical 


15 




SDR 


Dictionary 


Random 


Linear mix- 
-ture of time 
samples 


Uniform / 
Random 


2x 
# Sparse 
Words 


/ ^2 / 
SLO 


Data 
compression 


16 


Channel 
estimation 


Multipath 
channels 


Time 


Random 


Frequency 
or time 


Uniform / 
Nonuniform 


2 X # Spa- 
-rse channel 
components 


ii 1 
MIMAT 


Channel 
equalization/ 
OFDM 



problem (which is a special case of real-field error correction 
codes), and may be solved by similar techniques. The third 
topic is on sensor networks (row 13). Distributed sampling 
and recovery of a physical field using an array of sparse 
sensors is a problem of increasing interest in environmental 
and seismic monitoring applications of sensor networks [38]. 
Sensor fields may be bandlimited or non-bandlimited. Since 
the power consumption is the most restricting issue in sensors, 
it is vital to use the lowest possible number of sensors (sparse 
sensor networks) with the minimum processing computation. 

In Sparse Component Analysis (SCA), the number of ob- 
servations is much less than the number of sources (signals). 
However, if the sources are sparse in the time domain, then 
the active sources and their amplitudes can be determined; 
this is equivalent to error correction codes. Sparse Dictionary 



Representation (SDR) is another new area where signals are 
represented by the sparsest number of words (signal bases) 
in a dictionary of finite number of words; this sparsity may 
result in tremendous amount of data compression. When the 
dictionary is overcomplete, there are many ways to represent 
the signal; however, we are interested in the sparsest repre- 
sentation. Normally, for extraction of statistically independent 
sources. Independent Component Analysis (IC^^ is used for a 
complete set of linear mixtures. In the case of a non-complete 
(underdetermined) set of linear mixtures, ICA can work if the 
sources are also sparse; for this special case, ICA analysis is 
synonymous with SCA. 

Finally, channel estimation is shown in row 16. In mobile 

^Also known as Karhunen Loeve Transform (KLT). 
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communication systems, multipath reflections create a channel 
that can be modeled by a sparse FIR filter. For proper decoding 
of the incoming data, the channel characteristics should be es- 
timated before it can be equalized. For this purpose, a training 
sequence is inserted within the main data, which enables the 
receiver to obtain the output of the channel by exploiting this 
training sequence. The channel estimation problem becomes a 
deconvolution problem under noisy environments. The sparsity 
criterion of the channel greatly improves the channel estima- 
tion; this is where the algorithms for extraction of a sparse 
signal could be employed [22], [23], [39]. 

When sparsity is random, further signal processing is 
needed. In this case there are three items that need to be 
considered. 1- Evaluating the number of sparse coefficients 
(or samples), 2- finding the position of sparse coefficients, 
and 3- determining the values of these coefficients. In some 
applications only the first two items are needed; e.g., in 
spectral estimation. However, in almost all the other cases 
mentioned in Table HIl all the three items should be determined. 
Various types of Linear Programming (LP) and some itera- 
tive algorithms, such as the Iterative Method with Adaptive 
Thresholding (IMAT), determine the number, positions and 
values of sparse samples at the same time. On the other hand, 
the Minimum Description Length (MDL) method, used in 
DOA/MSL and spectral estimation, determines the number 
of sparse source locations or frequencies. In the subsequent 
sections, we shall describe, in more details, each algorithm 
for various areas and applications based on Table HH 

Finally, it should be mentioned that the signal model for 
each topic or application may be deterministic or stochastic. 
For example, in the sampling category for rows 2-4 and 7, 
the signal model is typically deterministic although stochastic 
models could also be envisioned [40]. On the other hand for 
random sampling and CS (rows 5-6), the signal model is 
stochastic although deterministic models may also be envi- 
sioned [41]. In channel coding and estimation (rows 8-9 and 
16), the signal model is normally deterministic. For Spectral 
and DOA estimation (rows 10-11), stochastic models are 
assumed; while for array beam-forming (row 12), deterministic 
models are used. In sensor networks (row 13), both determin- 
istic and stochastic signal models are employed. Finally, in 
SCA (rows 14-15), statistical independence of sources may 
be necessary and thus stochastic models are applied. 

II. Sampling: Uniform, Nonuniform, Missing, 
Random, Compressed Sensing, Rate of Innovation 

Analog signals can be represented by finite rate discrete 
samples (uniform, nonuniform, or random) if the signal has 
some sort of redundancies such as band-limitedness, finite 
polynomial representation (e.g., periodic signals that are rep- 
resented by a finite number of trigonometric polynomials), 
and nonlinear functions of such redundant functions [42], 
[43]. The minimum sampling rate is the Nyquist rate for 
uniform sampling and its generalizations for nonuniform [2] 
and multiband signals [44]. When a signal is discrete, the 
equivalent discrete representation in the "frequency" domain 
(DFT, DCT, DWT, Discrete Hartley Transform (DHT), Dis- 



crete Sine Transform (DST)) may be sparse, which is the 
discrete version of bandlimited or multiband analog signals. 

For discrete signals, if the nonzero coefficients ("frequency" 
sparsity) are consecutive, depending on the location of the 
zeros, they are called lowpass, bandpass, or multiband dis- 
crete signals; otherwise, the "frequency" sparsity is random. 
The number of discrete time samples needed to represent a 
frequency- sparse signal follows the law of algebra, that is, 
the number of time samples should be equal to the number 
of coefficients in the "frequency" domain; this is equivalent 
to the Nyquist rate- twice the bandwidth (for discrete signals 
with DC components it is twice the bandwidth minus one). The 
dual of frequency- sparsity is time- sparsity, which can happen 
in a burst or a random fashion. The number of "frequency" 
coefficients needed follows the Nyquist criterion. This will be 
further discussed in Section |llll for sparse additive impulsive 
noise channels. 

A. Sampling of Sparse Signals 

If the sparsity locations of a signal are known in a transform 
domain, then the number of samples needed in the time (space) 
domain should be at least equal to the number of sparse co- 
efficients, i.e., the so called Nyquist rate. However, depending 
on the type of sparsity (lowpass, bandpass, or random) and the 
type of sampling (uniform, periodic nonuniform, or random), 
the reconstruction may be unstable and the corresponding 
reconstruction matrix may be ill-conditioned [45], [46]. Thus 
in many applications mentioned in Table HIl the sampling rate 
in column 6 is higher than the minimum (Nyquist) rate. 

When the location of sparsity is not known, by the law of 
algebra, the number of samples needed to specify the sparsity 
is at least twice the number of sparse coefficients. Again for 
stability reasons, the actual sampling rate is higher than this 
minimum figure [2], [44]. To guarantee stability, instead of 
direct sampling of the signal, a combination of the samples can 
be used. Donoho [30] has recently shown that if we take linear 
combinations of the samples, the minimum stable sampling 
rate is of the order 0{k log(^)), where n and k are the frame 
size and the sparsity number, respectively. 

1 ) Reconstruction Algorithms: There are many reconstruc- 
tion algorithms that can be used depending on the sparsity pat- 
tern, uniform or random sampling, complexity issues, and sen- 
sitivity to quantization and additive noise [47], [48]. Among 
these methods are: Linear Programming (LP), Lagrange inter- 
polation [49], time varying method [50], spline interpolation 
[51], matrix inversion [52], Error Locator Polynomial (ELP) 
[53], iterative techniques [1], [46], [54], [55], [56], [57], [58], 
and Iterative Methods with Adaptive Thresholding (IMAT) 
[26], [32], [59], [60]. In the following we will only concentrate 
on the last three methods that have proven to be effective and 
practical. 

Iterative Methods When the Location of Sparsity is Known: 
The reconstruction algorithms have to recover the original 
sparse signal from the information domain and the type of 
sparsity in the transform domain. We know the samples (both 
position and amplitude) and we know the location of sparsity 
in the transform domain. An iteration between these two 
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Fig. 3. Block diagram of the iterative reconstruction method. The Mask is 
an appropriate filter with coefficients of I's and O's depending on the type of 
sparsity in the original signal. 
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TABLE III 

The ITERATIVE ALGORITHM BASED ON THE BLOCK DIAGRAM OF FigEI Fig. 5. The Iterative Method with Adaptive Thresholding (IMAT) for 

detecting the number, location, and values of sparsity. 



Convert the input to the i^^ iteration (x^*)) into the 
transform domain (for instance the Fourier domain); 
x^^^ is normally the initial received signal. 
Multiply the transformed signal (X^*)) by a mask (for 
instance a band-limiting filter). 

Take the inverse transform of the result in step 2 to 



getr(^). 

Set the new result as: x(^+^) = x(°) + x^*) 
Repeat for a given number of iterations. 
Stop when ||x(*+^) - x(^)||£2 < ^• 



domains (Fig. [5] and Table [lll| or consecutive Projections Onto 
Convex Sets (POCS) should yield the original signal [45], [54], 
[55], [58], [61], [62], [63], [64]. 

In case of the usual assumption that the sparsity is in the 
"frequency" domain and for the uniform sampling case of 
lowpass signals, one projection (bandlimiting in the frequency 
domain) suffices. However, if the frequency sparsity is random, 
the time samples are nonuniform, or the "frequency" domain is 
defined in a domain other than the DFT, then we need several 
iterations to have a good replica of the original signal. In 
general, this iterative method converges if the "Nyquist" rate 
is satisfied, i.e., the number of samples per block is greater 
than or equal to the number of coefficients. Figure (H shows 
the improvement in dB versus the number of iterations for 
a random sampling set for a bandpass signal. In this figure. 
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Fig. 4. SNR improvement vs. the no. of iterations for a random sampling 
set at the Nyquist rate (OSR=l) for a bandpass signal. 



besides the standard iterative method, accelerated iterations 
such as Chebyshev and Conjugate Gradient methods are also 
used (please see Appendix U for the algorithms) [65] . 

Iterative methods are quite robust against quantization 
and additive noise. In fact, we can prove that the iterative 
methods approach the pseudo-inverse (least squares) solution 
for a noisy environment; specially, when the matrix is ill- 
conditioned [44]. 

Iterative Method with Adaptive Threshold (IMAT) for Un- 
known Location of Sparsity: As expected, when sparsity is 
assumed to be random, further signal processing is needed. We 
need to evaluate the number of sparse coefficients (or samples), 
the position of sparsity, and the values of the coefficients. 
The above iterative method cannot work since projection (the 
masking operation in Fig. [3]) onto the "frequency" domain is 
not possible without the knowledge of the positions of sparse 
coefficients. In this scenario, we need to use the knowledge 
of sparsity in some way. The introduction of an adaptive 
nonlinear threshold in the iterative method can do the trick, 
thus the name: Iterative Method with Adaptive Threshold 
(IMAT); the block diagram and the algorithm are depicted 
in Fig. [5] and Table [iVl respectively. The algorithms in [66], 
[32], [26], [24] are variations of this method. Figure [5] shows 
that by alternate projections between information and sparsity 
domains (adaptively lowering or raising the threshold levels 
in the sparsity domain), the sparse coefficients are gradually 
picked up after several iterations. This method can be consid- 
ered as a modified version of Matching Pursuit as described in 
Section IVI-D.li the results are shown in Fig.[6l The sampling 
rate in the time domain is twice the number of unknown sparse 
coefficients. This is called the full capacity rate; this figure 
shows that after about 15 iterations, the SNR reaches its peak 
value. In general, the higher the sampling rate relative to the 
full capacity, the faster is the convergence rate and the better 
is the SNR value. 

Matrix Solutions: When the sparse nonzero locations are 
known, matrix approaches can be utilized to determine the 
values of sparse coefficients [52]. Although these approaches 
are rather straight forward, they may not be robust against 
quantization or additive noise. 

There are other approaches such as Spline interpolation [51], 
nonlinear/time varying methods [52], Lagrange interpolation 
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TABLE IV 

Generic IMAT of Fig.[5]for any sparsityin the Discrete 
Transform (DT), which is typically the Fast Fourier 
Transform (FFT). 
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Fig. 6. SNR vs. the no. of iterations for sparse signal recovery using the 
IMAT (Table Irvl. 

[49] and Error Locator Polynomial (ELP) [67] that will not be 
discussed here. These methods work quite well in the absence 
of additive noise but they may not be robust in the presence 
of noise. However the ELP approach will be discussed in Sec. 
IIILAi variations of this method are called the annihilating filter 
in sampling with finite rate of innovation (Sec. IILCI) and the 
Prony's method in spectral and DOA estimation (Sec. lIV-Ab . 

B. Compressed Sensing (CS) 

The relatively new topic of Compressed (Compressive) 
Sensing (CS) which was originally introduced in [30] and 
further extended in [68], [69] and [31] deals with sampling of 
sparse signals, in general. The idea is to introduce sampling 
schemes with low number of required samples which uniquely 
represent the original sparse signal; these methods have lower 
computational complexities than the traditional techniques 
that employ oversampling and then apply compression. In 
other words, compression is achieved exactly at the time of 
sampling. Unlike the classical sampling theorem [70], the 
signals are assumed to be sparse in an arbitrary transform 
domain, not necessarily the Fourier transform. Furthermore, 
there is no restricting assumption for the location of nonzero 
coefficients in the sparsity domain; i.e., the locations should 
not follow a specific pattern such as lowpass or multiband 



structure. Clearly, this assumption includes a more general 
class of signals than the ones previously studied. 

Since the concept of sparsity in a transform domain is easier 
to study for discrete signals, most of the research in this field 
is focused along discrete type signals [71]; however, recent 
results [72] show that most of the work can be generalized to 
continuous signals that have a sparse representation in a Riesz 
basi^ [73]. We first study discrete signals and then briefly 
discuss the extension to the continuous case. 

1) CS Mathematical Modeling: Let the vector x G M"^ be a 
finite length discrete signal in the time domain which has to be 
under-sampled. We assume that x has a sparse representation 
in a transform domain denoted by a unitary matrix ^nxn; i-^., 
we have: 

X = * • s (2) 

where s is an n x 1 vector which has at most k non-zero 
elements (/c-sparse vectors). In practical cases, s has at most 
k significant elements and the insignificant elements are set 
to zero which means s is an almost /c-sparse vector. For 
example, x can be the pixels of an image and ^ can be the 
corresponding IDCT matrix. In this case, most of the DCT 
coefficients are insignificant and if they are set to zero, the 
quality of the image will not degrade significantly. In fact, this 
is the main concept behind some of the lossy compression 
methods such as JPEG2000. Since the inverse transform on 
X yields s, the vector s can be used instead of x, which 
can be succinctly represented by the location and values of 
the nonzero elements of s. Although this method efficiently 
compresses x, it initially requires all the samples of x to 
produce s, which undermines the whole purpose of CS. 

Now let us assume that instead of samples of x, we take 
m linear combinations of the samples (called generalized 
samples). If we represent these linear combinations by the 
matrix ^mxn and the resultant vector of samples by ymxi, 
we have: 

Ymxl = ^mxn ' ^nxl = ^mxn ' ^nxn ' ^nxl (3) 

The question is how the matrix $ and the size m should 
be chosen to ensure that these samples uniquely represent the 
original signal x. Apparently, the case of $ = 1^ x n for m = n 
yields a trivial solution (keeping all the samples of x) that 
does not employ the sparsity characteristic. We look for $ 
matrices with as few rows as possible which can guarantee 
the invertibility of the sampling process for the class of sparse 
inputs. 

To solve this problem, we introduce probabilistic measures; 
i.e., instead of exact recovery of signals, we focus on the 
probability that a random sparse signal (according to a given 
probability density function) fails to be reconstructed using its 
generalized samples. If the probability of failure approaches 

^The sequence of vectors {vn} is called a Riesz basis if there exist scalars 
Q < A < B < oo such that for every absolutely summable sequence of 
scalars {an}, we have the following inequalities: 



1) Use the all-zero block as the initial value of the sparse 
domain signal (0*^ iteration) 

2) Convert the current estimate of the signal in the sparse 
domain into the information domain (for instance the 
time domain into the Fourier domain) 

3) Where possible, replace the values with the known 
samples of the signal in the information domain. 

4) Convert the signal back to the sparse domain. 

5) Use adaptive hard thresholding to distinguish the orig- 
inal nonzero samples. 

6) If neither the maximum number of iterations has past 
nor a given stopping condition is fulfilled, return to the 
2nd step. 
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zero, we can state that the sampHng scheme (the joint pair of 
is successful in recovering x with probabiHty 1. 
Let us assume that represents the first m rows of 

an invertible matrix ^nxn- It is apparent that if we use 
|$(^)}^^^ as the sampHng matrices for a given sparsity 
domain, the failure probabilities for and ^'^^^ are one 
and zero respectively, and as the index m increases, the 
failure probability decreases. The important point is that the 
decreasing rate of the failure probability is exponential with 
respect to ^ [74]. Therefore, we expect to reach an almost 
zero failure probability much earlier than m = n despite the 
fact that the exact rate highly depends on the mutual behavior 
of the two matrices More precisely, it is shown in [74] 

that: 

c rn 

Pfailure<n-e .H^M^W^ (4) 

where c is a positive constant and ^"^^^ ) is the maximum 
coherence between the rows of * and defined by [75]: 

= V^- max |(V'a,'^6)| (5) 

l<a<n,l<6<m 

where V^a, </>6 are the a^^ and h^^ rows of the matrices ^ and 
^, respectively. The above result implies that, the probability 
of reconstruction is close to one for: 

9, ^ ^ , /c • Inn 

m> (6) 

V ' c 

The above derivation implies that, the lower the maximum 
coherence between the two matrices, the lower is the number 
of required samples. Thus, to decrease the number of samples, 
we should look for matrices $ with low coherence with ^. 
For this purpose, we use a random It is shown that the 
coherence of a random matrix with i.i.d. Gaussian distribution 
with any unitary ^ is considerably small [30], which makes 
it a proper candidate for the sampling matrix. Investigation on 
the probability distribution has shown that the Gaussian PDF 
is not the only solution (for example a binary distribution is 
considered in [76]) but may be the simplest to analyze. 

The inequality shown in Q can be simplified in terms 
of m,n and k\ it can be shown [3] and [71] that a sparse 
signal can be reconstructed from its compressed samples with 
a probability of almost one if: 

Tl 

m> c k\og( — ) (7) 
k 

2) Reconstruction from Compressed Measurements: In this 
subsection, we would like to consider the reconstruction algo- 
rithms and stability issues. Essentially, there are three methods: 
A- Geometric, B- Combinatorial, and C- Information Theo- 
retic. We would like to briefly discuss these three methods. 

Geometric Methods: The oldest methods for reconstruction 
from compressed sampling are geometric, i.e., ii minimization 
techniques for finding a /c-sparse vector s G from a set of 
m = 0(/clog(^)) measurements (i/i); see e.g., [30], [74], 
[77], [78], [79]. Let us assume that we have applied a suitable 
^ which guarantees the invertibility of the sampling process. 
The reconstruction method should be a technique to recover 
a /c-sparse vector s^xi from the observed samples ymxi = 




No. of sparse coefficients (1^) 

Fig. 7. Relation between m and k for sparse DFT and DCT signals; the 
frame size is n = 2^^. 

or possibly Ymxi = 
where u denotes the noise vector. Suitability of $ implies that 
Snxi is the only /c-sparse vector that produces the observed 
samples; therefore, s^xi is also the sparsest solution for y = 
$ • ^ • s. Consequently, s can be found using: 

minimize subject to y = $ • ^ • s (8) 

Minimization with respect to ^o-norm (sparsity) is an NP- 
complete problem in general. However, it is shown in [80] 
that minimization with ^i-norm results in the same vector s 
for many cases. The interesting part is that the number of 
required samples to replace £o with -minimization has the 
same order of magnitude as the one for the invertibility of the 
sampling scheme. Hence, s can be derived from ([5]) using ii- 
minimization. It is worthwhile to mention that replacement of 
^i-norm with ^2 -norm, which is faster to implement, does not 
necessarily produce reasonable solutions. However, there are 
greedy methods (Matching Pursuit as discussed in Sec. IVllon 
SCA [81], [82]) which iteratively approach the best solution 
faster than ^i-norm optimization (Basis Pursuit as discussed 
in SeclVHon SCA). 

The technique of IMAT, discussed in random sampling and 
simulated in Fig. [6] for sparse DFT signals, can also be used 
for the recovery of a sparse signal in other transform domains 
such as DCT. It should be noted that this sampling process 
is a special case of CS. Instead of a linear combination of 
samples, the actual random samples are used [27]. Simulation 
results shown in Fig. [7] confirm the relation represented in dT]). 
In our simulation results, perfect reconstruction corresponds to 
an SNR value of 100 dB with a rehability of at least 80%. It is 
interesting to see that (|7]) closely tracks the DFT sparse signal 
for a range of k when c = 1. When k increases for a given n 
(i.e., the signal becomes less sparse), the relative number of 
needed measurements to specify the signal is decreased and 
the relation (|7]) is no longer valid. 

A sufficient condition for these methods to work is that the 
matrix $ • ^ must satisfy the so called Restricted Isometric 
Property (RIP) [83], [84], [76]; which will be discussed in the 
following subsection: 

RIP: It is important that in the presence of noise, the 
algorithms produce the best approximate solution to within 
a precision; in other words, small perturbations in the signal 
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caused by noise result in small distortions in the output 
solution. This characteristic is usually defined as stability. 
In compressed sensing, the stability of the reconstruction is 
determined by the characteristics of the sampling matrix 
We say that the matrix $ has RIP of order k, when for all 
/c-sparse vectors s, we have [31]: 



1-4 < 



l^-s|| 



< 1 + 4 



(9) 



where < 4 < 1 (isometry constant). The RIP is a sufficient 
condition that provides us with the maximum and minimum 
power of the samples with respect to the input power and 
ensures that none of the /c-sparse inputs fall in the null space 
of the sampling matrix. The RIP property essentially states 
that every 4/c columns of the matrix ^mxn must be almost 
orthonormal. The explicit construction of a matrix with such 
a property is difficult for any given n m; however, the 
problem has been studied in some cases [41], [85]. Moreover, 
given such a matrix finding s (or alternatively x) via the 
minimization problem involves linear programming with n 
variables and m constraints which can be computationally 
expensive. 

Among the matrices that satisfy the RIP condition are 
Gaussian random matrices. If $ is a Gaussian random matrix 
with the number of rows satisfying (|6]), is also a Gaussian 
random matrix with the same number of rows and thus it 
satisfies RIP, which guarantees a stable recovery. Assume that 
instead of ^ • s, we have $ • s + i/, where v represents the 
additive noise vector. Since $ • s + i/ may not belong to the 
range space of $ over /c-sparse vectors, the £i minimization of 
([5]) may not produce a solution. Thus we employ the following 
minimization instead: 

minimize \\s\\£^ subject to ||y — $ • s||^2 < ^ (1^) 

where is the maximum noise power. Let us denote the result 
of the above minimization for y = $- s + z/bys, which is 
also a /c-sparse vector. Now we have: 

||y-^-s||^, = ||$.(s-s)+i/||^, <6 

^||^-(S-S)||,, <||l.||,,+6 (11) 

Since both s and s are /c-sparse, s — s is 2/c-sparse, and by 
using the RIP, we get: 

(1 - 52km - 4i2 < 11^ • (s - < HU, + e (12) 
Or equivalently, 

lliylL 

(13) 



s-s < 



1 - 4fe 



This shows that small perturbations in the input cause small 
perturbations in the output (stability). Moreover, as ap- 
proaches unity, the distortion in the output caused by the input 
additive noise becomes more significant; the ideal case is when 
S2k = 0. 



Combinatorial: Another standard approach for reconstruc- 
tion of compressed sampling is combinatorial. The sampling 
matrix $ is found using bipartite graphs, and consists of 
binary entries, i.e., entries that are either 1 or 0. Binary search 
methods are then used to find an unknown /c-sparse vector 
s e M^, see e.g., [77], [86], [87], [88], [89], [90], [91], [92] 
and the references therein. Typically, the binary matrix $ has 
m = 0{klogn) rows, and there exist fast algorithms for 
finding the solution x from the m measurements (typically 
a linear combination). However, the construction of $ is also 
difficult. 

Information Theoretic: A more recent approach is adaptive 
and information theoretic [93]. In this method, the signal s G 
is assumed to be an instance of a vector random variable 
X = (xi,...,Xn)\ and the i^^ row of $ is constructed 
using the value of the previous sample i/i-i. Tools from the 
theory of Huffman coding are used to develop a deterministic 
construction of a sequence of binary sampling vectors (j)a- (i.e., 
the components of (j)a- consist of or 1) in such a way as to 
minimize the average number of sampling (rows of $) needed 
to determine a signal. 

3) Almost Sparse Signals and Noisy Measurements: An 
important issue in compressed sampling is the robustness to 
noisy measurements. Specifically, the reconstruction of a k- 
sparse vector s from noisy i/i^ = -\-r]i should produce 
a /c-sparse vector s, which is close to s. 

Another important aspect is the behavior of compressive 
sampling algorithms on almost sparse signals which are more 
likely to occur in applications than exactly /c-sparse vectors. 
For example a /c-sparse vector s may be corrupted by noise 
T] eW^, producing the vector s = s + r/. Another example is 
the wavelet transform of an image which consists mostly of 
small coefficients and a few large coefficients. Obviously, any 
method for the sampling and reconstruction of sparse signals 
must also be well adapted to almost sparse signals, i.e., if a 
sampling and reconstruction method is applied to an almost 
/c-sparse signal s, it must produce a /c-sparse signal s that 
includes the k most significant coefficients of s (up a to small 
error). 

4) CS for Analog Signals: Recently, there have been efforts 
to extend the concept of CS to analog signals [72]. The sparse 
signals are assumed to be the elements of a Shift-Invariant 
(SI) space generated by n kernels with period T. For instance, 
assume {ai{t)}]L^ form a Riesz basis (see the footnote in page 
[6]) [73] for 1/2 ; the respective generated SI space is 

n 

5/ = •a;(i-fcT) | di[k]&R 

1=1 fcez 

, f]^||d,[fc]f <oo} (14) 

For example, the set of all lowpass signals with bandwidth 
B form an SI space with a single kernel (sinc{2Bt)). Simi- 
larly, the set of multiband signals with the frequency support 
defined as n fixed disjoint intervals of equal length form an n- 
kernel SI space. The classical sampling theorems suggest that 
the sampling rate of ^ is sufficient for the signal recovery for 
the space given in (fT^I) . 
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Now, a signal x{t) G SI is called /c-sparse if in its 
basis representation, at most k generators among the total 
n are active; i.e., di[k] = for / ^ {h^hj - - - jh} where 
{h^h^ • ' • ^h} is an arbitrary subset of {1, 2, . . . , n}. Similar 
to the discrete case, we are looking for sampling schemes that 
employ the inherent sparsity in order to decrease the sampling 
rate. In [72], it is shown that the rate could be as low as ^ 
for /c-sparse signals; this is the theoretical lower bound for 
invertible sampling rates. Instead of the sampling matrix a 
filter bank is used for sampling; the signal is passed through 
p filters (2k < p < n) followed by samplers, each sampling at 
rate ^. The analog continuous signal x{t) is thus transformed 
to an infinite length vector; generalization of the CS for finite 
length vectors has also been studied in [72]. 



C. Sampling with Finite Rate of Innovation 
The classical sampling theorem states that: 



(15) 



where B is the bandwidth of x{t) with the Nyquist interval 
Ts = These uniform samples can be regarded as the 
degrees of freedom of the signal; i.e., a lowpass signal with 
bandwidth B has one degree of freedom in each Nyquist 
interval Tg. Replacing the sine function with other kernels 
in ([T5l) , we can generalize the sparsity (bandlimitedness) in 
the Fourier domain to a wider class of signals known as the 
SI spaces: 

x(t) = ^Q-(^(^-z) (16) 



Similarly, the above signals have one degree of freedom in 
each Ts period of time (the coefficients Ci). A more general 
definition for the degree of freedom is introduced in [4] and 
is named the Rate of Innovation. For a given signal model, 
if we denote the degree of freedom in the time interval of 
[ti^h] by Cx{ti^t2), the local rate of innovation is defined by 
t2-ti ^xiti^h) and the global rate of innovation (p) is defined 
as 



lim -^Cx{t -r.t^r) 

r^co Zt 



(17) 



provided that the limit exists; in this case we say that the 
signal has finite rate of innovation [4], [28], [94], [95]. As an 
example, for the lowpass signals with bandwidth B we have 
p = 2B, which is the same as the Nyquist rate. In fact by 
proper choice of the sampling process, we are extracting the 
innovations of the signal. Now the question that arises is that 
whether the uniform sampling theorems can be generalized 
to the signals with finite rate of innovation. The answer is 
positive for a class of non-bandlimited signals including the 
SI spaces. Consider the following signals: 

R 



X{t) = ^^Q,r • 



t-ti 



(18) 



where {(pr{t)}^^i are arbitrary but known functions and 
{ti}iez is a realization of a point process with mean /i. The 





<p(t) 
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Fig. 8. Sampling with the kernel Lp(t) 



free parameters of the above signal model are {ci^r} and {ti}. 
Therefore, for this class of signals we have P = however, 
the classical sampling methods cannot reconstruct these kinds 
of signals with the sampling rate predicted by p. There are 
many variations for the possible choices of the functions (pr{t)', 
nonetheless, we just describe the simplest version. Let the 
signal x{t) be a finite mixture of sparse Dirac functions: 



x{t) =^Ci •5{t-ti) 



(19) 



where {ti} is assumed to be an increasing sequence. We intend 
to show that the samples generated by proper sampling kernels 
(^{t) (shown in Fig. [S]) can be used to reconstruct the sparse 
Dirac functions. In fact we choose the kernel ip{t) to satisfy 
the so called Strang-Fix condition of order 2k'. 

V0<r<2fc-1, 3 {ar^i}iez- 

^ar^iif{t-i) =f (20) 

The above condition for the Fourier domain becomes: 
<I>(1] = 0) 7^0 

<l>(^) = 2iTi) =0, V z 7^ G Z (21) 
r = 0, ...,2A:-1 

where ^{^) denotes the Fourier transform of (^{t), and the 
superscript (r) represents the r^^ derivative. It is also shown 
that such functions are of the form ip{t) = f{t) * P2k{t), 
where P2k{t) is the B-spline of order 2k^^ and f{t) is an 
arbitrary function with nonzero DC frequency [94] . Therefore, 
the function (32k{t) is itself among the possible options for the 
choice of (p{t). 

We can show that for the sampling kernels which satisfy the 
Strang-Fix condition (l2Ql) , the innovations of the signal x{t) 
([T9l ) can be extracted from the samples (y[j]): 



y[j] = (^W * 



(22) 



Thus, 



k 



i=i jez 

k 



(23) 



i=l 
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In other words, we have filtered the discrete samples (y[j]) in 
order to obtain the values ; (l23l) shows that these values are 
only a function of the innovation parameters (amplitudes Ci 
and time instants U). However, the values are nonlinearly 
related to the time instants and therefore, the innovations 
cannot be extracted from using linear als ebrfl However, 
these nonlinear equations form a well-known system which 
was studied by Prony in the field of spectral estimation (see 
Sec. lIV-Ab and its discrete version is also employed in both 
real and Galois field versions of Reed-Solomon codes (see Sec. 
IIII-AI) . This method which is called the annihilating filter is 
as follows: 

The sequence {rr} can be viewed as the solution of a 
recursive equation. In fact if we define H{z) = Yl^=o — 
Yl^^i{z — ti), we will have (see Sec. IIII-AI and Appendices 
[nl [nil for the proof of a similar theorem): 



Vr : 



(24) 



In order to find the time instants ti, we find the polynomial 
H{z) (or the coefficients hi) and we look for its roots. A 
recursive relation for becomes: 



n r2 



Tk 

^k+1 
T2k-1 





' hi ' 












T/e+2 




_ 




_ T2k 



(25) 



By solving the above linear system of equations, we obtain 
coefficients hi (for a discussion on invertibility of the left 
side matrix see [94], [96]) and consequently, by finding the 
root of H{z), the time instants will be revealed. It should 
be mentioned that the choice of ri,...,r2/c in ([25]), can 
be replaced with any 2k consecutive terms of {ri}. After 
determining {ti}, (|23]) becomes a linear system of equations 
with respect to the values {ci} which could be easily solved. 

This reconstruction method can be used for other types 
of signals satisfying ([TSl) such as the signals represented by 
piecewise polynomials [94]. An important issue in nonlinear 
reconstruction is the noise analysis; for the purpose of de- 
noising and performance under additive noise the reader is 
encouraged to see [28]. 

III. Error Correction Codes: Galois and 
Real/Complex Fields 

The relation between sampling and channel coding is the 
result of the fact that over-sampling creates redundancy [97]. 
This redundancy can be used to correct for "sparse" impulsive 
noise. Normally, the channel encoding is performed in finite 
Galois fields as opposed to real/complex fields; The reason is 
the simplicity of logic circuit implementation and insensitivity 
to the pattern of errors. On the other hand, the real/complex 
field implementation of error correction codes has stability 
problems with respect to the pattern of impulsive, quantization 

^Note that the Strang-Fix condition can be also used for an exponential 
polynomial assuming the delta functions are non-uniformly periodic; in that 
case Tr in equation (|23) is similar to E, the DFT of the impulses, as defined 
in Appendices [III and |lll] 



and additive noise [46], [53], [67], [98], [99], [100], [101]. 
Nevertheless, such implementation has found applications in 
fault tolerant computer systems [102], [103], [104], [105], 
[106] and impulsive noise removal from 1-D and 2-D signals 
[32], [33]. Similar to finite Galois fields, real/complex field 
codes can be implemented in both block and convolutional 
fashions. 

A discrete real-field bloclc code is an oversampled signal 
with n samples such that, in the transform domain (e.g., DFT), 
a contiguous number of high frequency components are zero. 
In general, the zeros do not have to be the high frequency 
components or contiguous. However, if they are contiguous, 
the resultant m equations (from the syndrome information do- 
main) and m unknown erasures form a Vandermonde matrix, 
which ensures invertibility and consequently erasure recovery. 
The DFT block codes are thus a special case of Reed-Solomon 
(RS) codes in the field of real/complex numbers [97]. A more 
general condition to have a Vandermonde matrix is that the 
indices of the zero frequencies form an arithmetic progression 
using mod{n). This is equivalent to having contiguous zeros 
in the Sorted DFT (SDFtQ) domain [2], [53], [107]. 

Figure [9] represents a convolutional encoder of rate | of 
finite constraint length [97] and infinite precision per symbol. 
Fig. Oa) is a systematic convolutional encoder and resembles 
an oversampled signal discussed in Section [III if the FIR 
filter acts as an ideal lowpass filter. Fig. [3b) is a non- 
systematic encoder used in the simulations to be discussed 
subsequently. In case of additive impulsive noise, errors could 
be detected based on the side information that there are 
frequency gaps in the original oversampled signal (syndrome). 
In the following subsections, various algorithms for decoding 
along with simulation results are given for both block and 
convolutional codes. Some of these algorithms can be used in 
other applications such as spectral and channel estimation. 

A. Decoding of Block Codes- ELP Method 

Iterative reconstruction for an erasure channel is identical 
to the missing sampling problem [108] discussed in Section 
III-A.ll and therefore, will not be discussed here. Let us 
assume that we have a finite discrete signal Xorig[i], where 
i = 1,...,/. The DFT of this sequence yields / complex 
coefficients in the frequency domain (Xorig[j]^ j = 1, . . . , /)• 
If we insert p consecutive zero^ to get n = / + p samples 
(X[j], j = 1, . . . ,n) and take its inverse DFT, we end up 
with an oversampled version of the original signal with n 
complex samples (x[z], z = 1, . . . , n). This oversampled signal 
is real if Hermitian symmetry (complex conjugate symmetry) 
is preserved in the frequency domain, e.g., the set 9 of p 
zeros are centered at ^. For erasure channels, the sparse 
missing samples are denoted by e[im] = ^[^m], where i^'s 
denote the positions of the lost samples; consequently, for 
i 7^ im, = 0. The Fourier transform of e[i] (called 

^The kernel of SDFT is exp (^^i q), where q is relatively prime w.r.t. n; 
this is equivalent to a sorted version of DFT coefficients according to a mod 
rule. 

^We call the set of indices of consecutive zeros syndrome positions and 
denote it by B; this set includes the complex conjugate part in a block of n 
samples. 
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Input Sequence at 
Nyquist rate 



(b) 



Fig. 9. Convolutional Encoders (a) A real-field systematic convolutional 
encoder of rate ^; /i[i]'s are the taps of an FIR filter; (b) A non-systematic 
convolutional encoder of rate |, /ii[i]'s and /i2[i]'s are the taps of 2 FIR 
filters. 



E[j]^ j = 1, . . . , n) is known for the syndrome positions 9. 
The remaining values of E[j] can be found from the following 
recursion (see Appendix |Il|): 



k 

E[r] = --^^£:[r + t]/ifc_t 



(26) 



t=l 



where r is a member of the complement of 6 and the 
index additions are in mod{n). After finding E[j] values, the 
spectrum of the recovered oversampled signal X[j] can be 
found by removing E[j] from the received signal (see (l97l) 
in Appendix HD). Hence the original signal can be recovered 
by removing the inserted zeros at the syndrome positions of 
X[j]. The above algorithm, called Error Locator Polynomial 
(ELP) algorithm, is capable of correcting any combination 
of erasures. However, if the erasures are bursty, the above 
algorithm may become unstable. To combat bursty erasures, 
we can use the SDFT [2], [53], [109], [110], [107] instead 
of DPT. The simulation results for block codes with erasure 
and impulsive noise channels are given in the following two 
subsections. 

1 ) Simulation Results for Erasure Channels: The simula- 
tion results for the ELP decoding implementation for n = 32, 
p = 16, and k = 16 erasures (a burst of 16 consecutive missing 
samples from position 1 to 16) are shown in Fig. [TOl 

Since consecutive sample losses represent the worst case 
[53], [110], the proposed method works better for random 
samples. In practice, the error recovery capability of this 
technique degrades with the increase of the block and/or burst 
size due to the accumulation of round-off errors. In order to 
reduce the round-off error, instead of the DFT, a transform 
based on the SDFT, or Sorted DCT (SDCT) can be used [2], 
[53], [1 10]. These types of transformations act as an interleaver 
to break down the bursty erasures. 

2 ) Simulation Results for Random Impulsive Noise Chan- 
nel: There are several methods to determine the number. 
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Num. of Erasures as the percentage of Capacity 

Fig. 10. Recovery of a burst of 16 sample losses. 



locations and values of the impulsive noise samples, namely: 
Modified Berlekamp-Massey for real fields [111], [112], ELP, 
IMAT, and CFAR-RDE. The Berlekamp-Massey method for 
real numbers is sensitive to noise and will not be discussed 
here [111]. The ELP method is described in the next subsec- 
tion, while the other two methods are discussed below. 

ELP Method [96]: When the number and positions of the 
impulsive noise samples are not known, ht in ([26l) is not 
known for any t; therefore, we assume the maximum possible 
number of impulsive noise samples per block, i.e., k = [^^J 
as given in ([94l) in Appendix [III To solve for ht, we need 
to know only n — I samples of E in the positions where 
zeros are added in the encoding procedure. Once the values of 
ht are determined from the pseudo-inverse [96], the number 
and positions of impulsive noise can be found from (l96b in 
Appendix HH The actual values of the impulsive noise can be 
determined from (l26l) as in the erasure channel case. For the 
actual algorithm, please refer to Appendix Jill As we are using 
the above method in the field of real numbers, exact zeros 
of {Hk}, which are the DFT of {hi}, are rarely observed; 
consequently, the zeros can be found by thresholding the 
magnitudes of H^. Alternatively, the magnitudes of Hk can 
be used as a mask for soft-decision; in this case, thresholding 
is not needed. 

CFAR-RDE and IMAT Methods [32]: The Constant False 
Alarm Rate with Recursive Detection Estimation (CFAR- 
RDE) method is similar to the Iteration Method with Adaptive 
Thresholding (IMAT) with the additional inclusion of the 
CFAR module to estimate the impulsive noise; CFAR is exten- 
sively used in radars to detect and remove clutter noise from 
data. In CFAR, we compare the noisy signal with its neighbors 
and determine if an impulsive (sparse) noise is present or not 
(using soft decision jl. After removing the impulsive noise in 
a "soft" fashion, we estimate the signal using the iterative 
method for an erasure channel as described in Section III-A.ll 
for random sampling or using the ELP method. The impulsive 
noise and signal detection and estimation go through several 
iterations in a recursive fashion as shown in Fig. [TT] As 
the number of recursions increases, the certainty about the 
detection of impulsive noise locations also increases; thus, the 
soft decision is designed to act more like the hard decision 

^This has some resemblance to soft decision iteration for turbo codes[101]. 
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Fig. 11. CFAR-RDE method with the use of adaptive soft thresholding and an iterative method for signal reconstruction. 
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Fig. 12. Comparison of CFAR-RDE and a simple soft decision RDE for 
DFT block codes. 

at the latter parts of the iteration steps, which yields the 
error locations. Meanwhile, further iterations are performed to 
enhance the quality of the original signal since suppression of 
the impulsive noise also suppresses the original signal samples 
at the location of the impulsive noise. The improvement of 
using CFAR-RDE over a simple soft decision RDE is shown 
in Fig. M 

B. Decoding for Convolutional Codes 

The performance of convolutional decoders depends on 
the coding rate, the number and values of FIR taps for 
the encoders, and the type of the decoder. Let us take the 
convolutional encoder of rate ^ of Fig. Ob) as our platform 
for simulations. For simulation results, the taps of the encoder 
of Fig. Ob) are 

/ii = [1,2,3,4,5,16], 

/i2 = [16,5,4,3,2,1] (27) 

The input signal is taken from a uniform random distribution 
of size 50 and the simulations are run 1000 times and then 
averaged. The following subsections describe the simulation 
results for erasure and impulsive noise channels. 

1 ) Decoding for Erasure Channels: For the erasure chan- 
nels, we employ two methods as described below: 



Iterations with Averaging: An iterative method to decode 
for erasures in the convolutional code of Fig. Ob) is shown in 
Fig. \T3\ This figure is designed for the rate ^ convolutional 
encoder. At each stage of decoding, the results of the two 
branches are averaged. For the rate ^ and specific FIR struc- 
ture, the SNR improvement versus the relative rate of erasures 
with respect to the theoretical maximum rate of correction 
capability (full capacity) is shown in Fig. [141 This figure shows 
that the SNR values gradually decrease as the sampling rate 
reaches the full capacity. 

Decoding Using the Generator Matrix: The generator ma- 
trix of a convolutional encoder of the type depicted in Fig. 
Ob) with taps given in (l27l) can be shown to be [5] 



1 














16 














2 


1 











5 


16 











3 


2 


1 








4 


5 


16 








4 


3 


2 


1 





3 


4 


5 


16 





5 


4 


3 


2 


1 


2 


3 


4 


5 


16 


16 


5 


4 


3 


2 


1 


2 


3 


4 


5 





16 


5 


4 


3 





1 


2 


3 


4 








16 


5 


4 



An iterative decoding scheme for this matrix representation 
is similar to that of Fig. [3] except that the operator G consists 
of the generator matrix, a mask (erasure operation) and the 
transpose of the generator matrix. If the rate of erasure does 
not exceed the encoder full capacity, the matrix form of 
the operator G can be shown to be a nonnegative definite 
square matrix and therefore its inverse exists [1], [45]. Thus, 
with a proper choice of the relaxation parameter, the iteration 
represented in Fig. [13] converges to the actual signal. 

By using the above operator G in our iterative simulations. 
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Fig. 13. Iterative decoding for a rate ^ convolutional encoder as shown in Fig. [9lb). 




Fig. 14. SNR vs. the percentage of erasure for the convolutional decoder of 
Fig. [bI after 50 iterations. 




Fig. 15. Simulation results of a convolutional decoder, using the iterative 
method with the generator matrix, after 30 CG iterations (see Appendix H); 
SNR vs. the relative rate of erasures (w.r.t. full capacity) in an erasure channel. 



better results can be obtained in comparison with the averaging 
method of Fig. [TH Figure [15] shows that the SNR values 
gradually decrease as the rate of erasure reaches its maxi- 
mum (capacity). This figure shows that the generator matrix 
approach of decoding using the iteration matrix performs much 
better than the averaging method represented in Figs. [13] and 
[T4l However, the complexity of the matrix approach is higher 
than the averaging method. 



2) Decoding for Impulsive Noise Channels: Let us consider 
X and y as the input and the output streams of the encoder, 
respectively, related to each other through the generator matrix 
G as y = Gx. 

Denoting the observation vector at the receiver by y, we 
have y = y + i/, where v is the impulsive noise vector. 
Multiplying y by the transpose of the parity check matrix 
H^, we get 



H^y = H^(y + z/) = H^Gx + H^z/ 



H^y = ITu 



(29) 



Multiplying the resultant by the right pseudo-inverse of the 
H^, we derive: 

H(H^H)-^H^y = H(H^H)-^H^z/ 

= z> (30) 



by 
the 



Thus by multiplying the received vector 
H(H^H)~^H^, we obtain an approximation of 
impulsive noise. In the IMAT method, we apply the operator 
H(H^H)"^H^ in the iteration of Fig. \5\ the threshold level 
is reduced exponentially at each iteration step. The block 
diagram of IMAT in Fig. [5] is modified as shown in Fig. [16] 

For simulation results, we use the generator matrix shown 
in ([28]) : its generator matrix can be calculated from [5] and is 
given below: 
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(31) 



In our simulations, the locations of the impulsive noise 
samples are generated randomly and their amplitudes have 
Gaussian distributions with zero mean and variance equal to 
1, 2, 5 and 10 times the variance of the encoder output. The 
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Fig. 16. The modified diagram of IMAT method from Fig. |5] 
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Fig. 17. Simulation results by using the IMAT method for detecting the 
location and amplitude of the impulsive noise, A = 1.9. 



results are shown in Fig. [17] after 300 iterations. This figure 
shows that the high variance impulsive noise has a better 
performance. 

C. LDPC Codes 

Low Density Parity Check (LDPC) codes are another 
example of the application of sparse matrices in reducing 
complexity [113], [114], [115], [116]. LDPC codes are linear 
block codes whose parity check matrix, H is spars43- The 
term "low-density" refers to this fact. A code with a sparse 
H matrix which has equal number of I's in each row and 
equal number of I's in each column is called a regular LDPC 
code. The iterative decoding algorithm that is used to decode 
a given code, with either a high or low density of ones in the 
H matrix, is such that the decoding complexity grows almost 
linearly with the block length. The coefficient of this increase 
is directly dependent on the number of I's in the H matrix as 
is explained below. 

For decoding LDPC codes, we use Tanner graphs [117]; 
these graphs may also be useful in decoding real-field codes 
when the parity matrix is sparse and possibly binary. Each 
iteration module of the decoding algorithm consists of two 
steps. In the first step, for each row, certain operations are 
performed for each 1 on that row. Similarly for each column, 
certain operations are performed for each 1 on that column in 
the second step. Therefore, the sparser the matrix is, the lower 
the decoding complexity will be. 

As an example, consider an H matrix of a regular LDPC 
code which has 6 and 3 ones in its rows and columns, 
respectively. Consider another code whose H matrix has 
similarly 8 and 4 ones. The decoding complexity of the former 
code is 75% of the latter. Note that in general (assuming that 

^^Note that the generator matrix G of an LDPC code is not necessarily 
sparse and the above discussions are only valid for decoding and not encoding. 



H matrix has full rank), for a given code rate R, the ratio of 
the number of ones in each column to the number of ones in 
each row is a constant which is equal io 1 — R. Therefore, if 
the number of ones in the rows is scaled by a factor of c, the 
number of ones in the columns has to be scaled with the same 
factor to keep the code rate fixed. Consequently, the decoding 
complexity is scaled by c. 

1) LDPC Decoding Using Linear Programming (LP): 
The linear programming decoding of block codes was 
initially suggested by Feldman in 2003 [118], [119]. Consider 
Maximum Likelihood (ML) decoding of block codes; this 
is an optimization problem where we want to minimize a 
cost function by finding a codeword that has the minimum 
Euclidean distance to the received vector from the channel. In 
the logarithm domain, the cost function can be transformed 
into a linear function of the coded bits (see [120]). Note 
that the codeword has to satisfy a linear set of parity check 
equations. Therefore, the problem of decoding a block code 
can be considered as an LP optimization, and the computa- 
tional complexity grows exponentially with the block length. 
However, a sub-optimal solution to this problem significantly 
reduces the complexity [120]. This modification however 
makes the overall performance and complexity of the method 
comparable to that of the conventional decoding method of 
LDPC codes, the Maximum A Posteriori (MAP) decoding on 
a bit level. 

The performance of the LP decoding method is usually 
inferior to that of MAP decoding but it has been shown to be 
superior to that of the MIN-SUM, which is a simplification of 
the original MAP decoding algorithm [121]. 

IV. Spectral Estimation 

In this section, we review some of the methods which 
are used to evaluate the frequency content of data [8], [9], 
[10], [11]. In the field of signal spectrum estimation, there 
are several methods which are appropriate for different types 
of signals. Some methods are known to better estimate the 
spectrum of wideband signals, where some others are proposed 
for the extraction of narrow-band components. Since our focus 
is on sparse signals, it would be reasonable to assume sparsity 
in the frequency domain, i.e., we assume the signal to be a 
combination of several sinusoids plus white noise. 

Conventional methods for spectrum analysis are non- 
parametric methods in the sense that they do not assume any 
model (statistical or deterministic) for the data, except that it 
is zero or periodic outside the observation interval. As a well 
known non-parametric method, we can name the periodogram 
Pperif) that can be computed via the EFT: 



Pperif) 



1 



r=0 



-j27TfrT, 



(32) 



where m is the number of observations, Ts is the sampling 
interval (usually assumed as unity), and Xr is the signal. 
Although non-parametric methods are robust with low compu- 
tational complexity, they suffer from fundamental limitations. 

^^Fo r further discussion on LP and its relation to Basis Pursuit, please refer 
to Sec. lVI-D.2l and Table Hxl 
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The most important limitation is its resolution; too closely 
spaced harmonics cannot be distinguished if the spacing is 
smaller than the inverse of the observation period. 

To overcome this resolution problem, parametric methods 
are devised. Assuming a statistical model with some unknown 
parameters, we can get more resolution by estimating the 
parameters from the data at the cost of more computational 
complexity. Theoretically, in parametric methods we can re- 
solve two closely spaced harmonics with limited data length 
if the SNR goes to infinit}|3. 

In this section, we shall discuss three parametric approaches 
for spectral estimation: the Pisarenko, the Prony, and the 
well known MUSIC algorithms. The first two are mainly 
used in spectral estimation, while the MUSIC method that 
was first developed for array processing has been extended 
to spectral estimation (see the next section). It should be 
noted that the parametric methods unlike the non-parametric 
approaches, require prior knowledge of the model order (the 
number of tones). This can be decided from the data using the 
Minimum Discription Length (MDL) method discussed in the 
next section. 

A. Prony Method 

The Prony method was originally proposed for modeling 
the expansion of gases [122]; however, now it is known as 
a general spectral estimation method. In fact, Prony tried to 
fit a weighted mixture of k damped complex exponentials to 
2k data measurements. The original approach is related to 
the noiseless measurements; however, it has been extended to 
produce the least squared solutions for noisy measurements. 
We focus only on the noiseless case here. The signal is 
modeled as a weighted mixture of k complex exponentials 
with complex amplitudes and frequencies: 



(33) 



where Xr is the noiseless discrete sparse signal consisting of 
k exponentials with parameters 

bi aiC^^' 

Zi = 6^^""^^^^ (34) 

where , 6>i , fi represent the amplitude, phase and the fre- 
quency (fi is a complex number in general), respectively. Let 
us define the polynomial H{z) such that its roots represent 
the complex exponential functions related to the sparse tones 
(see section ITFCl on FRI, (l26l) on LLP and Appendix HB: 

k k 

H{z) = l[{z-Zi)=J2hiz''-' (35) 

i=l i=0 

By shifting the index of (|33]) and multiplying by the parameter 
hj and summing over j we get: 

k k k 

hjXr-j = biZ^-^ Y ^3^^i~^ = (36) 

j=0 i=l j=0 

Similar to array processing to be discussed in the next section, we can 
resolve any two closely spaced sources conditioned on 1) limited snapshots 
and infinite SNR or 2) limited SNR and infinite number of observations, while 
the spatial aperture of the array is kept finite. 



TABLE V 
Basic Prony Algorithm 



3) 



Solve the recursive equation in (|36) to evaluate hi's. 
Find the roots of the polynomial represented in (|35); 
these roots are the complex exponentials defined as Zi 
in (ED. 

Solve f33l to obtain the amplitudes of the exponentials 



where r is indexed in the range k -\- 1 < r < 2k. This 
formula implies a recursive equation to solve for hi [9] . After 
the evaluation of hi's, the roots of (1351) yield the frequency 
components. Hence the amplitudes of the exponentials can be 
evaluated from a set of linear equations given in (|33]) . The 
basic Prony algorithm is given in Table (V] 

The Prony method is sensitive to noise, which was also 
observed in the LLP and the annihilating filter methods 
discussed in sections ITlI- Al and IlI-CI There are extended Prony 
methods that are better suited for noisy measurements. 

B. Pisarenko Harmonic Decomposition (PHD) 

The PHD method is based on the polynomial of the 
Prony method that utilizes the eigen-decomposition of the 
data covariance matrix [11]. Assume k complex tones are 
present in the spectrum of the signal. Then, decompose the 
covariance matrix of /c + 1 dimensions into a /c-dimensional 
signal subspace and a 1 -dimensional noise subspace that are 
orthogonal to each other. The noise subspace is spanned by 
the eigenvector v which satisfies Rv = a^v. 

By including the additive noise, the observations are given 
by: 



(37) 



where y is the observation sample and is a zero-mean noise 
term that satisfies E{vr^r+i} = cr'^S[i]. By replacing Xr = 
i/r — fr in the difference equation (l36l) , we get 

k 

hiyr-i = y^hi^r-i (38) 



k 
i=0 



=0 



which reveals the Auto-Regressive Moving Average (ARMA) 
structure (order {k^k)) of the observations yr as a random 
process. To benefit from the tools in linear algebra, let us 
define the following vectors 



y = [vr, yr-k\ 
h = [1, /ii, hk]^ 



[Ur-, . . . , ^r-k\ 



Now 



can be written as 



(39) 



(40) 



Multiplying both sides of (l40t by y and taking the expected 
value, we get £'{yy^}h = E{yv^}\i. Note that 



^{yy^} = R, 



^yy(O) 

Ryy{k) 



RyyiO) 



(41) 
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TABLE VI 
PHD Algorithm 



1) Given the model order k (number of sinusoids), find 
the autocorrelation matrix of the noisy observations 
with dimension k -\- 1 (Ryy). 

2) Find the smallest eigenvalue (cr^) of Ryy and the 
corresponding eigenvector (h). 

3) Set the elements of the obtained vector as the coef- 
ficients of the polynomial in (35}. The roots of this 
polynomial are the estimated frequencies. 



^{yi/^} = E{(x + = 
We thus have an eigen-equation 



Ryyh = a h 



(42) 



(43) 



which is the key equation of the Pisarenko method. The eigen- 
equation of (l43l) states that the elements of the eigenvector of 
the covariance matrix, corresponding to the smallest eigen- 
value (cr^), are the same as the coefficients in the recursive 
equation of x[r] (coefficients of the ARM A model in ([38])). 
Therefore, by evaluating the roots of the polynomial with 
coefficients that are the elements of this vector, we can find 
the tones in the spectrum. 

Although we started by eigen-decomposition of Ryy, we 
observed that only one of the eigenvectors is required; the one 
that corresponds to the smallest eigenvalue. This eigenvector 
can be found using simple approaches (in contrast to eigen- 
decomposition). The PHD method is briefly shown in Table 

lYD 

A different formulation of the PHD method with linear 
programming approach (refer to Sec. IVI-D.2I for description 
of linear programming) for array processing is studied in 
[123]. The PHD method is shown to be equivalent to a 
geometrical projection problem which can be solved using £i- 
norm optimization. Let us convert the autocorrelation matrix 
Rmxm into an X 1 vector. If Ri and R2 are the spa- 
tial autocorrelation matrices of two uncorrected sources, the 
overall autocorrelation matrix is Ri + R2 when both sources 
are active, which is translated to a similar summation of the 
corresponding vectors. Although it seems that the vectorized 
notation generates a subspace for uncorrelated sources, only 
linear combinations of x 1 vectors with positive coefficients 
are acceptable (resulting in a hyper-cone) since R is positive 
definite. When the number of sources is less than the number 
of sensors, the respective vector is restricted to lie on the 
surface of the cone. Due to the existence of noise, the observed 
vector falls within the cone. It is shown [123] that the PHD 
method projects the measured vectors into the surface. For 
the purpose of the projection, £i-norm optimization can be 
employed instead of the common Pisarenko method. 

C. MUSIC 

Multiple Signal Classification (MUSIC), is a method orig- 
inally devised for high resolution source direction estimation 
in the context of array processing that will be discussed in 



the next section [124]. The inherent equivalence of the array 
processing and time series analysis paves the way for the 
employment of this method in spectral estimation. MUSIC 
can be understood as a generalization and improvement of the 
Pisarenko method. It is known that in the context of array 
processing, MUSIC attains the statistical efficienc}0 in the 
limit of asymptotically large number of observations [12]. This 
is a valuable property since MUSIC contains only a 1-D search 
while efficient ML methods require /c-dimensional searches 

MUSIC estimates the frequencies by finding k local maxima 
of a 1-D spectrum function while ML exhaustively searches 
a /c-dimensional parameter space to find the global maximum 
of a /c-variable spectrum function. 

In the PHD method, we construct an autocorrelation matrix 
of dimension k -\- 1 under the assumption that its smallest 
eigenvalue (a^) belongs to the noise subspace. Then we use 
the Hermitian property of the covariance matrix to conclude 
that the noise eigenvector should be orthogonal to the signal 
eigenvectors. In MUSIC, we extend this method, using a 
noise subspace of dimension greater than one to improve the 
performance. We also use some kind of averaging over noise 
eigenvectors to obtain a more reliable signal estimator. 

The data model for the sum of exponentials plus noise can 
be written in the matrix form as 



y = Ab + 

where the length of data is taken sls m > k and 



(44) 



1 

j{m-l)uji 



1 

pj(m-l)a;fc 



(45) 

where ly represents the noise vector. Since the frequencies 
are different, A is of rank k and the first term in (l44l) forms 
a /c-dimensional signal subspace, while the second term is 
randomly distributed in both signal and noise subspaces; i.e., 
unlike the first term, it is not confined to a subspace of lower 
dimension. The correlation matrix of the observations is given 
by 



b 4 


[61, .. 




A 

1/ = 




iT 


A 

y = 







R = Abb^A 



(46) 



where the noise is assumed to be white with variance a^. 
If we decompose R into its eigenvectors, k eigenvalues 
corresponding to the /c-dimensional subspace of the first term 
of (l46l) are essentially greater than the remaining m — k 
values, cr^, corresponding to the noise subspace; thus, by 
sorting the eigenvalues, the noise and signal subspaces can 
be determined. Assume co is an arbitrary frequency and 



e(cj) = [1, 



, e^(^-i)^]. The MUSIC method estimates 



^^Statistical efficiency of an estimator means that it is asymptotically 
unbiased and its variance goes to zero. 

^"^The error function is non-convex and has multiple local minima; thus 
gradient descent methods cannot be used to achieve a global minimum. 
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Fig. 18. Spectral estimation of a sparse mixture of sinusoids using Prony, 
Pisarenko and MUSIC methods; input SNR is 5dB and 1024 time samples 
are used. 



the spectrum content of the signal at frequency uj by projecting 
the vector e(cc;) into the noise subspace. When the projected 
vector is zero, the vector e(cj) falls in the signal subspace 
and most likely, uj is among the spectral tones. In fact, the 
frequency content of the spectrum is inversely proportional to 
the £2 -norm of the projected vector: 

PmU {00) = _ur.^lrl.r.^ (47) 



e^(cj)n^e(cj) 



;=/c+i 



(48) 



where v^'s are eigenvectors of R corresponding to the noise 
subspace. 

The k peaks of Pmu{^) are selected as the frequencies 
of the sparse signal. The determination of the number of 
frequencies (model order) in MUSIC is based on the MDL and 
Akaike Information Criterion (AIC) methods to be discussed 
in the next section. 

Figure [18] shows results for various spectral line estimation 
methods discussed above. The first upper figure shows the 
original spectral lines, and the three other figures belong to 
the results of the estimation methods. We can see that the 
Prony method (which is similar to ELP and annihilating filter 
of Sec. III-CI and (l26l) ) does not yield good results, while the 
PHD method is better. 

V. Sparse Array Processing 

We address three types of array processing: 1- Estimation of 
Multi-Source Location (MSL) and Direction of Arrival (DOA), 
2- Sparse Array Beam-forming and Design, and 3- Sparse 
Sensor Networks. The first topic is related to estimating the 
direction and/or the location of multiple targets; this problem 
is very similar to the problem of spectral estimation dealt 
with in the previous section. The second topic is related to 
the design of sparse arrays with some missing and/or random 
array sensors. The last topic, depending on the type of sparsity, 
is either similar to the second topic or related to CS of sparse 



/ / / 
/ / / 
/ / / 

/ / / / 
/ / / / 
/ / / / 
/ / / / 


Incoming wave 
////// 
////// 
////// 

^ / / / / / 

?/ / / / 

^\ /N / / / 


I 





Fig. 19. Uniform linear array with element distance d, element length /, 
and a wave arriving from direction 



signal fields in a network. Below each topic will be briefly 
described. 

A. Array Processing for MSL and DOA Estimation 

Among the important fields of active research in array 
processing are MSL and DOA estimation [124], [125], [126]. 
In such schemes, a passive or active array of sensors is used to 
locate the sources of narrow-band signals. Some applications 
may assume far-field sources (e.g. radar signal processing) 
where the array is only capable of DOA estimation, while 
other applications (e.g. biomedical imaging systems) assume 
near-field sources where the array is capable of locating 
the sources of radiation. A closely related field of study is 
spectral estimation due to similar linear statistical models. The 
stochastic sparse signals pass through a partially known linear 
transform (e.g., array response or inverse Fourier transform) 
and are observed in a noisy environment. 

In the array processing context, the common temporal 
frequency of the source signals are known. Spatial sampling 
of the signal is used to extract direction of the signal (spatial 
frequency). As a far-field approximation, the signal wavefronts 
are assumed to be planar. Consider a signal arriving with angle 
Lp as in Fig. [191 Simultaneous sampling of this wavefront 
on the array will exhibit a phase change of the signal from 
sensor to sensor. In this way, discrete samples of a complex 
exponential are obtained, where its frequency can be translated 
to the direction of the signal source. The response of a Uniform 
Linear Array (ULA) to a wavefront impinging on the array 
from direction Lp is 



a(v^) = [1, e 



j27r4 sin((^) 



j(n-l)27rf sin( 



(49) 



where d is the inter-element spacing of the array, A is the 
wavelength, and n is the number of sensors in the array. When 
multiple sources are present, the observed vector is the sum 
of the response(sweep) vectors and noise. This resembles the 
spectral estimation problem with the difference that sampling 
of the array elements is not limited in time. In fact, in array 
processing an additional degree of freedom (the number of 
elements) is present; thus, array processing is more general 
than spectral estimation. 

Two main fields in array processing are MSL and DOA for 
estimating the source locations and directions, respectively; 
for both purposes, the angle of arrival (azimuth and elevation) 
should be estimated while for MSL an extra parameter of range 
is also needed. The simplest case is the 1-D ULA (azimuth- 
only) for DOA estimation. 
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For the general case of k sources with angles cpi^ . . . ^ipk 
with respect to the array, the ULA response is given by the 
matrix A{(f) = [a((/9i), . . . , a((/?/c)], where the vector (f of 
DOA's is defined 2iS cp = [(pi^ . . . , (pk] - In the above notation, 
A is a matrix of size n x k and a{(piys are column vectors. 
Now, the vector of observations at array elements (x[z]) is 
given by 

x[z] = As\i]^iy\i] (50) 

where the vector s[i] represents the multi-source signals and 
is the white Gaussian noise. Source signals and additive 
noise are assumed to be zero-mean and i.i.d. normal processes 
with covariance matrices P and cr^I, respectively. With these 
assumptions, the observation vector x[z] will also follow an 
n-dimensional zero-mean normal distribution with the covari- 
ance matrix 

R = £;{xx^} = APA^ + cr^I (51) 

In the field of DOA estimation, extensive research has been ac- 
complished in 1) source enumeration, and 2) DOA estimation 
methods. Both of the subjects correspond to the determination 
of parameters k and (p. 

Although some methods are proposed for simultaneous 
detection and estimation of the model statistical character- 
istics [127], most of the literature is devoted to two-stage 
approaches; first, the number of active sources is detected 
and then their directions are estimated by techniques such 
as Estimation of Signal Parameters via Rotational Invariance 
Techniques (ESPRIxfl [128], [129], [130], [131], [132]. 
Usually the joint detection-estimation methods outperform the 
two-stage approaches with the cost of higher computational 
complexity. Source enumeration techniques often rely on the 
eigenvalues of the covariance matrix obtained from the re- 
ceived data. These eigenvalues are indicators of the energy 
level in certain principal components of the data. In the MSL 
estimation step, similar to the MUSIC method, the signal 
is decomposed into two orthogonal subspaces of signal and 
noise. The signal subspace corresponds to the column space 
of the sweep matrix A; thus, proper estimation of the signal 
subspace leads to estimation of the matrix A and consequently 
cp and range [12], [13]. In the sequel, we take a closer look 
at these two steps. 

1) Minimum Description Length (MDL): One of the most 
successful methods in array processing for source enumeration 
is the use of MDL criterion [133]. This technique is very 
powerful and outperforms its older versions including AIC 
[134], [135], [136]. Hence, we confine our discussion to MDL 
algorithms. 

Preliminaries: MDL is an optimum method of finding the 
model order and parameters for the most compressed repre- 
sentation of the observed data. For the purpose of statistical 
modeling, the MAP probability or, the suboptimal criterion 

^^The array in ESPRIT is composed of sensor doublets with the same 
displacement. The parameters of the impinging signals can be estimated via 
a rotational invariant property of the signal subspace. The complexity and 
storage of ESPRIT is less than MUSIC; it is also less vulnerable to array 
imperfections. ESPRIT, unlike MUSIC results in an unbiased DOA estimate; 
nontheless, MUSIC outperforms ESPRIT, in general. 



of ML is used; more precisely, conditioned on the observed 
data, the maximum probability among the possible options is 
found (hypotheses testing) [137]. When the model parameters 
are not known, the MAP and ML criteria result in the most 
complex approach; consider fitting a finite sequence of data to 
a polynomial of unknown degree [37]: 

x{ti) = P{ti) + u{ti), i = l, . . . , m (52) 

where P{t) = ao + ait + • • • + akt^, u{t) is the observed 
Gaussian noise and k is the unknown model order (degree 
of the polynomial P(t)) which determines the complexity. 
Clearly, m — 1 is the maximum required order for unique 
description of the data (m observed samples) and the ML 
criterion, always selects this maximum value (kuL = m — 1); 
i.e., the ML method forces the polynomial P{t) to pass 
through all the points. MDL, on the other hand, yields a sparser 
solution (kMDL < m — 1). 

Due to the existence of additive noise, it is quite rational 
to look for a polynomial with degree less than m which also 
takes the complexity order into account. In MDL, the idea 
of how to consider the complexity order is borrowed from 
information theory: Given a specific statistical distribution, we 
can find an optimum source coding scheme (e.g., Huffman 
coding) which attains the lowest average code length for the 
symbols. Furthermore, if Px is the distribution of the source 
X and Qx is another distribution, we have [138]: 

H{x) = Ep^{-\ogpx) < Ep^{-\ogqx) (53) 

where H{x) is the entropy of the signal. This implies that 
the minimum average code length is obtained only for the 
correct source distribution (model parameters); in other words, 
the choice of wrong model parameters (distribution function) 
leads to larger code lengths. When a particular model with 
the set of parameters 6 is assumed for the data a priori, each 
time a sequence x is received, the parameters first should 
be estimated. The optimum estimation method is usually the 
ML estimator which results in 6ml- Now, the probability 
distribution for a received sequence x becomes p{x\6ml) 
which according to information theory, requires an average 
code length of — log {p{x\6ml{x))) bits. In addition to the 
data, the model parameters should also be encoded which 
in turn require | log(m) bits where k, is the number of 
independent parameters to be encoded in the model and m is 
the number of data point^. Thus, the two part MDL selects 
the model that minimizes the whole required code length 
which is given by [139]: 

- log {p{x\Oml)) + ^ log(m) (54) 

The first term is the ML term for data encoding and the second 
term is a penalty function that inhibits the number of free 
parameters of the model to get very large. 

MDL Source Enumeration: In the source enumeration prob- 
lem, our model is a multivariate Gaussian random process 
with zero mean and covariance of the type shown in (ISTI) . 
where the number of active sources is unknown. In some 

^^For a video introduction to these concepts, please refer to 
http://videolectures.net/icml08.grunwaldjndl 
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enumeration methods (other than MDL), the exact form of dSTI) 
is employed which results in high computational complexity. 
In the conventional MDL method, it is assumed that the 
model is a covariance matrix with a spherical subspacefZl of 
dimension n — k. Suppose the sample covariance matrix is 



R 



-j^ m 

m ^-^ 



(55) 



and assume the ordered eigenvalues of R are Ai > A2 > • • • > 
An, while the ordered eigenvalues of the exact covariance 
matrix R are Ai > • • • > A^ > A/e+i = • • • = A^ = cr^. 
The normal distribution function of the received complex data 
X is [129] 

p(x;R)= . ,/^.^ e-^-i^-^^> (56) 



where tr(.) stands for the trace operator. The ML estimate 
of signal eigenvalues in R are A^, z = 1, . . . , /c with 
the respective eigenvectors {v^}^^^. Since A^+i = ••• = 
An = cr^, the ML estimate of the noise eigenvalue is = 
^EiL/c+i^^ {^JiLfe+i ^1"^ ^ ^oise eigenvectors. 
Thus, the ML estimate of R given R is 



R 



ML 



5: A, 



v,;v; 



H 



^2 



E 

;=/c+i 



V,;V, 



H 



(57) 



In fact, since we know that R has a spherical subspace of 
dimension n — k,^Q correct the observed R to obtain Rml- 
Now, we calculate — log (p(x|Rml)); from Appendix Hvl 
we have: 



(58) 



which is independent of k and can be omitted in the minimiza- 
tion of (l54l) . Thus, for the first term of (l54l) we only need the 
determinant |Rml| which is the product of the eigenvalues, 
and the MDL criterion becomes (see Appendix II Vb : 
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Fig. 20. An example of the performance of the MDL criterion in array 
processing. The MDL estimates the number of active sources, which is 2. 
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(59) 



where n is the number of free parameters in the distribution. 
This expression should be computed for different values of 
< /c < n — 1 and its minimum point should be kMDL- 
Note that we can subtract the term m Yll^=i log(Ai) from the 
expression, which is not dependent on k to get the well known 
MDL criterion [129] (also see Appendix Hvl) : 



m{n — k) log 



1 sr^n 
n—k ^i- 
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where the first term is the likelihood ratio for the spheric- 
ity test of the covariance matrix. This likelihood ratio is 
a function of arithmetic and geometric means of the noise 
subspace eigenvalues [140]. Figure [2Ql is an example of MDL 

Spherical subspace implies the eigenvalues of the cross-correlation matrix 
are equal in that subspace. 



performance in determining the number of sources in array 
processing. It is evident that in low SNR's, the MDL has a 
strong tendency to underestimate the number of sources, while 
as SNR increases, it gives a consistent estimate. Also at high 
SNR's, underestimation is more probable than overestimation. 

Now we compute the number of independent parameters 
(k,) in the model. Since the noise subspace is spherical, 
the choice of eigenvectors in this subspace can accept any 
arbitrary orthonormal set; i.e., no information is revealed 
when these vectors are known. Thus, the set of parameters 



(60) is {Ai, 



Afe,cr"', Vi, 



v/e}. The eigenvalues of a 



hermitian matrix (correlation matrix) are all real while the 
eigenvectors are normal complex vectors. Therefore, the eigen- 
values (including a^) introduce k-\-l degrees of freedom. The 
first eigenvector has 2n — 2 degrees of freedom (since its first 
nonzero element can be adjusted to unity); while the second, 
due to its orthogonality to the first eigenvector, has 2n — 4 
degrees of freedom. With the same argument, it can be shown 
that there are 2{n — i) free parameters in the i^^ eigenvector; 
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hence 

k 

= 1 + /c + ^ 2(n - z) = n(2n - fc) + 1 (61) 

i=l 

where the last integer 1 can be omitted since it is independent 
of A:. 

The two-part MDL, despite its very low computational 
complexity, is among the most successful methods for source 
enumeration in array processing. Nonetheless, this method 
does not reach the best attainable performance for finite 
number of measurements [141]. The new version of MDL, 
called one-part or Refined MDL has improved the performance 
for the cases of finite measurements which has not been 
applied to the array processing problem [37]. 

B. Sparse Array Beam-forming and Design 

Sparsely sampled irregular and random arrays have been 
used in several fields such as radar, sonar, ultrasound imaging, 
and seismology. The objective is to improve economy of 
design, e.g., achieve better resolution for a given maximum 
number of array elements. The whole idea of sparse arrays 
depends on having a feature which is better than needed, e.g., 
sidelobe suppression which can be traded for resolution. 

In array signal processing, the aperture smoothing function 
plays the same role as the transfer function of an LTI filter. 
Assume that n elements are spaced as a 1-D uniform linear 
array with a distance d and are located Xi = i • d for i = 
0, 1, n — l,as shown in Fig.[T9j The aperture smoothing 
function when each element is weighted by a scalar Wi is 

n-l 

W{u) = ^Wie-^''^'^^ (62) 

The variable u is defined hy u = sin (p where (p is called the 
azimuth angle, A the wavelength, and the weights, Wi, are a 
standard window function [14]. The weights, Wi, could also 
be angle-dependent if the individual elements are directional 
(dipoles). The aperture smoothing function determines how 
the wavefield Fourier transform is filtered by a finite aperture, 
just as the filter transfer function determines how the received 
signal spectrum is shaped by the filtering operation. The 
condition for avoiding aliasing is that the argument in the 
exponent satisfies 

\u\ 

27T^—^d= \f3:,\-d <7r (63) 
A 

where (3x = 27Tj is the x component of the wave-number. 
The relationship between the array pattern for a regular 1-D 
array and a filter transfer function is 

u 

uj ^ = 27r- 
T ^ d 

hi ^ Wi (64) 

By using these parallels, the time-frequency sampling theorem 
T < ^ translates into the spatial sampling theorem d < 
^^^^ . The aperture smoothing function at the Nyquist rate of 
a uniform linear array is depicted in Fig. [211 
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Fig. 21. The element response for an array with (d = A/2) and omni- 
directional elements (1 <C A). 

In addition, array processing has some additional degrees of 
freedom that are not so common in time-frequency processing, 
such as irregular sampling, and things that are not found at all 
such as elements along curved surfaces. 

A sparse array has elements removed from the aperture 
(thinning), and there are basically two approaches to finding 
the best thinning. The first is through optimization; This can 
be performed in 1-D and 2-D, and also with elements on 
an underlying regular grid or with freely chosen positions in 
the aperture (avoiding overlapping elements). It also makes 
a difference whether the array is flat or curved. The most 
popular optimization criterion is to minimize the peak sidelobe 
in the beampattern with a condition on the maximum mainlobe 
width. 

The second approach is more heuristic, and is based on the 
experience that it is often not possible to formulate optimality 
in a sense that is compatible with standard optimization 
algorithms. Also, in an imaging system, more degrees of 
freedom can be obtained in the optimization, if one allows 
the layouts of the transmitter and the receiver to be different. 
We will still assume that the receiver and transmitter arrays 
are located in the same position (monostatic), but now there 
is partial or no overlap between the selected elements. 

In the optimization approach, the challenge is the combina- 
torial problem of finding the best layout of sparse elements for 
one and two dimensions. The optimization problem is creation 
of beampatterns with low mainlobe width and low sidelobes. 
The layout optimization methods consist of LP, Genetic Al- 
gorithms (GA) and Simulated Annealing (SA) [142]. 

With a sparse optimized random array, we can get an array 
pattern as shown in Fig. [22l Comparing this figure to Fig. 
[2n we see that with even | of the elements, we can get 
comparable results for the peak sidelobe value. It is also 
evident that thinning has a price in that the overall energy 
in the sidelobe region has increased. 

Another example is shown in Fig. [23] where enhanced 
simulated annealing has been used. This approach can handle 
arbitrary layouts where the elements are not locked to a 
Cartesian grid and where each individual element can have 
different directivity. Also, it can easily be extended to handle 
wideband excitations and the response optimization in the near 
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Fig. 22. Array pattern for optimized array with 25 elements out of 101. 





(a) 



(b) 




(c) 



Fig. 23. Symmetric non-cartesian sparse arrays, (a) sparse hex-grid array 
thinned from 112 to 80 elements, (b) a square element array thinned from 
121 to 77 elements, and (c) a ring array thinned from 112 to 80 elements. 
Optimization is by simulated annealing [144]. 



field (focused arrays) [143]. 

The simple examples of Fig. [23] illustrate the geometries 
that the algorithm can handle and shows a sparse hex-grid 
optimized array with hexagonally shaped elements, a linear 
grid design with square elements, and a ring array having 
increasing element sizes for increasing ring radii. 

One of the optimization problems in array processing is 
to design the sparsest array for a given beam pattern [142], 
[145]. In this case, as shown in Table HIl in row 12, the 
random sparsity is in the space domain and the information 
domain is a desirable aperture array pattern; we conjecture 
that, for a given beampattern, the IMAT used for random 
sampling and impulsive noise removal (Section III- A. II) can 
be used in synthesizing the number, locations and weights 



Fig. 24. One-way array pattern after optimization for 64-element array 
randomly thinned to 48 elements. Thinning and weights are shown in Fig. 

ED 




Fig. 25. Weights found after optimization for beam-pattern of the random 
layout as shown in Fig. |24] 



of the sparse array elements. Another optimization issue for 
a given number of array elements is equivalent to designing 
the weights and locations of sparse random arrays to yield 
an optimum array /beam-pattern [146], see also Figs. [24l and 
[25l this kind of optimization is unique for this application 
and has no parallel in other applications. The weights and the 
array locations can be optimized separately or jointly; joint 
optimization of thinning pattern and weights has been reported 
in sonar arrays. Joint optimization of positions and weights is 
also possible, usually by iterating over a sequence of position 
optimization followed by weight optimization. Optimization 
of the element positions of a sparse array is considerably 
more difficult than weight optimization. The reason is that 
for an array with n elements, the number of combinations for 
selecting k array elements is 



(65) 



k\{n-k)l 



When n, k are large or for 2-D arrays, an exhaustive search 
is out of the question. There are several ways that this number 
can be reduced. The optimization criteria are the same for the 
layout problem as for the weight problem, i.e., 

• Minimize main sidelobe of the pattern while restricting 
the width of the main-lobe below a given limit. 
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• Minimize the total energy of the pattern wasted in the 
sidelobes while keeping the peak value of the main lobe 
above a given threshold. 

The second heuristic approach builds on the properties of 
some of the basic building blocks of sparse arrays: random 
arrays, binned arrays, and periodic arrays. 

Assume an array with n elements where only k elements are 
kept after random thinning. The average array pattern is equal 
to that of an aperture weighted with the probability density 
function of the thinning. This also determines the shape of the 
main lobe. For a uniform thinning, it turns out that the ratio of 
the average sidelobe power to the main lobe power is p But 
the variance is about k for \u\ = \ sm{(p)\ > ^, where L is 
the aperture. The relative peak level of a 1-D random array is 
V^ln(A:) [147] which gives, in our experience, a fairly good 
estimate of the peak level. 

In the binned array, the aperture is divided into k equal 
size bins and one element is chosen at random in each bin. 
This resembles a nearest neighbor restriction as there can be 
no more than two neighbor elements in a 1-D binned array. 
With a uniform distribution in each bin, it can be shown that 
the binned array has the same properties as the random array 
except that the variance does not reach the value of k until the 
angle reaches = | sin((/?)| > k^. Thus the binned array has 
random sidelobes close to the mainlobe that are much lower 
than the random array [147]. 

Periodic arrays are thinned with a periodicity, e.g., 
1001001001, 101010101, or 11001100. This means that grat- 
ing lobes are formed whose position and size are easily 
predicted. Periodic arrays have turned out to be particularly 
useful in imaging systems where one can use different peri- 
odicities for the transmitter and the receiver. As the two-way 
beampattern is the product of the transmitter and the receiver 
beampatterns, by proper design, grating lobes formed by the 
transmitter can be suppressed by the receiver and vice versa. 
A special case is the Vernier array which has periodicities p 
andp - 1 [148]. 

The simplest way to utilize the above properties is to use the 
properties of periodic arrays and variants where grating lobes 
in the transmitter are used to cancel the receiver grating lobes 
and vice versa. References [149], [150] describe simulations 
and experiments with a 2-D array for medical ultrasound of 
size 48 X 48 elements where several variations and combi- 
nations are utilized for minimizing sidelobes. The different 
variants investigated with partial overlap between transmitter 
and receiver elements are 

• Periodic 2-D arrays with different periodicity on trans- 
mission and reception 

• Arrays with diagonal periodicity combined with each 
other and with periodic arrays, see for example Fig. [26] 
[149]. It gives a two-way maximum sidelobe of —49 dB 
when un-steered and —40 dB when steered to (30, 30) 
degrees. 

• Arrays with periodicity along only a single axis and 
combinations with diagonal periodic arrays 

• Arrays with periodicity along radius vectors rather than 
the X- and ?/-axes 
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Fig. 26. (a) Periodic diagonal array with 877 out of 2304 elements used for 
transmission and (b) periodic array with 208 elements for reception- [149]. 



Arrays which have no overlap between the transmitter and 
receiver elements are 

• Binned arrays with different layouts for transmission and 
reception 

• Polar binned arrays where the bins are along rays ema- 
nating from the center of the array 

The proposed configurations have been tested on real data 
for an array that has been built. Good correspondence with 
simulations have been obtained. 

Curved Arrays: More recently, there has been some re- 
search on optimizations of curved arrays. The beam pattern 
can be found by projecting the elements onto a flat surface 
(Fourier projection- slice theorem), therefore, the beampattern 
is equivalent to that of an array with unequal spacing between 
the elements. This reduces grating lobes. There is an optimal 
radius of the curvature for the array, which minimizes the 
peak sidelobe in the two-way response [151]. The optimal 
value varies with the thinning method. Similar work for the 
optimization of one-way response has also been applied to 
cylindrical arrays for sonar applications [152]. 

C. Sparse Sensor Networks 

Wireless sensor networks typically consist of a large number 
of sensor nodes, spatially distributed over a region of interest, 
that observe some physical environment including acoustic, 
seismic, and thermal fields with applications in a wide range of 
areas such as health care, geographical monitoring, homeland 
security, and hazard detection. The way sensor networks are 
used in practical applications can be divided into two general 
categories: 

1) There exists a central node known as the Fusion Cen- 
ter (FC) that retrieves relevant field information from 
the sensor nodes and communication from the sensor 
nodes to FC generally takes place over a power- and 
bandwidth-constrained wireless channel. 

2) Such a central node does not exist and the nodes 
take specific decisions based on the information they 
obtain and exchange among themselves. Issues such 
as distributed computing and processing are of high 
importance in such scenarios. 

In general, there are three main tasks that should be im- 
plemented efficiently in a wireless sensor network: sensing. 
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communication, and processing. The main challenge in design 
of practical sensor networks is to find an efficient way of 
jointly performing these tasks, while using the minimum 
amount of system resources (computation, power, bandwidth) 
and satisfying the required system design parameters (such 
as distortion levels). For example, one such metric is the 
so-called energy-distortion tradeoff which determines how 
much energy the sensor network consume in extracting and 
delivering relevant information up to a given distortion level. 
Although many theoretical results are already available in the 
case of point-to-point links in which separation between source 
and channel coding can be assumed, the problem of efficiently 
transmitting or sharing information among a vast number of 
distributed nodes remains a great challenge. This is due to 
the fact that well-developed theories and tools for distributed 
signal processing, communications, and information theory in 
large-scale networked systems are still under development. 
However, recent results on distributed estimation or detection 
indicate that joint optimization through some form of source- 
channel matching and local node cooperation can result in 
significant system performance improvement [153], [154], 
[155], [156], [157]. 

1) How sparsity can be exploited in a sensor network: 
Sparsity appears in many applications for which sensor net- 
works are deployed, e.g., localization of targets in a large 
region or estimation of physical phenomena such as tempera- 
ture fields that are sparse under a suitable transformation. For 
example, in radar applications, under a far-field assumption, 
the observation system is linear and can be expressed as a 
matrix of steering vectors [158], [159]. In general, sparsity 
can arise in a sensor network from two main perspectives: 

1) Sparsity of node distribution in spatial terms 

2) Sparsity of the field to be estimated 

Although nodes in a sensor network can be assumed to be 
regularly deployed in a given environment, such an assumption 
is not valid in many practical scenarios. Therefore, the non- 
uniform distribution of nodes can lead to some type of sparsity 
in spatial domain that can be exploited to reduce the amount 
of sensing, processing, and/or communication. This issue is 
subsequently related to extensions of the nonuniform sampling 
techniques to two-dimensional domains through proper inter- 
polation and data recovery when samples are spatially sparse 
[38], [160]. The second scenario that provides a proper basis 
for exploiting the sparsity concepts arises when the field to be 
estimated is a sparse multi-dimensional signal. From this point 
of view, ideas such as those presented earlier in the context of 
compressed sensing (Sec. III-Bb provide the proper framework 
to address the sparsity in such fields. 

Spatial Sparsity and Interpolation in Sensor Networks: 
Although general two-dimensional interpolation techniques 
are well-known in various branches of statistics and signal 
processing, the main issue in a sensor network is exploring 
proper spatio/temporal interpolation such that communication 
and processing are also efficiently accomplished. While there 
is a wide range of interpolation schemes (polynomial, Fourier, 
and least squares [161]), many of these schemes are not 
directly applicable for spatial interpolation in sensor networks 
due to their communication complexity. 



Another characteristic of many sensor networks is the non- 
uniformity of node distribution in the measurement field. 
Although non-uniformity has been dealt with extensively in 
contexts such as signal processing, geo-spatial data processing, 
and computational geometry [2], the combination of irregular 
sensor data sampling and intra-network processing is a main 
challenge in sensor networks. For example, reference [162] 
addresses the issue of spatio-temporal non-uniformity in sen- 
sor networks and how it impacts performance aspects of a 
sensor network such as compression efficiency and routing 
overhead. In order to reduce the impact of non-uniformity, 
the authors in [162] propose using a combination of spatial 
data interpolation and temporal signal segmentation. A simple 
interpolation wavelet transform for irregular sampling which 
is an extension of the 2-D irregular grid transform to 3-D 
spatio-temporal transform grids is also proposed in [163]. 
Such a multi-scale transform extends the approach in [164] 
and removes the dependence on building a distributed mesh 
within the network. It should be noted that although wavelet 
compression allows the network to trade reconstruction quality 
for communication energy and bandwidth usage, such energy 
savings are naturally offset by the overhead cost of computing 
the wavelet coefficients. 

Distributed wavelet processing within sensor networks is 
yet another approach to reduce communication energy and 
wireless bandwidth usage. Use of such distributed processing 
makes it possible to trade long-haul transmission of raw data 
to the FC for less costly local communication and processing 
among neighboring nodes [163]. In addition, local collabora- 
tion among nodes decorrelates measurements and results in a 
sparser data set. 

Compressive Sensing in Sensor Networks: Most natural 
phenomena in SN's are compressible through representation 
in a natural basis [79]. Some examples of these applications 
are imaging in a scattering medium [158], MIMO radar [159], 
and geo-exploration via underground seismic data. In such 
cases, it is possible to construct a highly compressed version 
of a given field, in a decentralized fashion. If the correlations 
between data at different nodes are known a-priori, it is 
possible to use schemes that have very favorable power- 
distortion-latency tradeoffs ([153], [165], [166]). In such cases, 
distributed source coding techniques, such as Slepian-Wolf 
coding, can be used to design compression schemes without 
collaboration between nodes (see [165] and the references 
therein). Since prior knowledge of such correlations is not 
available in many applications, collaborative, intra-network 
processing and compression are used to determine unknown 
correlations and dependencies through information exchange 
between network nodes. In this regard, the concept of com- 
pressive wireless sensing has been introduced in [157] for 
energy-efficient estimation at the FC of sensor data, based on 
ideas from wireless communications [153], [155], [166], [167], 
[168] and compressive sampling theory [30], [84], [169]. The 
main objective in such an approach is to combine processing 
and communications in a single distributed operation [170], 
[171], [172]. 

Methods to obtain the required sparsity in a SN: While 
transform-based compression is well-developed in traditional 



24 



signal and image processing domains, the understanding of 
sparse transforms for networked data is not as trivial [173]. 
There are methods such as associating a graph with a given 
network, where the vertices of the graph represent the nodes 
of the network, and edges between vertices represent rela- 
tionships among data at adjacent nodes. The structure of the 
connectivity is the key to obtaining effective sparse transfor- 
mations for networked data [173]. For example, in the case of 
uniformly distributed nodes, tools such as DFT or DCT can 
be adopted to exploit the sparsity in the frequency domain. In 
more general settings, wavelet techniques can be extended to 
handle the irregular distribution of sampling locations [163]. 
There are also scenarios in which standard signal transforms 
may not be directly applicable. For example, network monitor- 
ing applications rely on the analysis of communication traffic 
levels at the network nodes where network topology affects the 
nature of node relationships in complex ways. Graph wavelets 
[174] and diffusion wavelets [175] are two classes of trans- 
forms that have been proposed to address such complexities. 
In the former case, the wavelet coefficients are obtained by 
computing the digital differences of the data at different scales. 
The coefficients at the first scale are differences between 
neighboring data points, and those at subsequent spatial scales 
are computed by first aggregating data in neighborhoods and 
then computing differences between neighboring aggregations. 
The resulting graph wavelet coefficients are then defined by 
aggregated data at different scales, and computing differences 
between the aggregated data [174]. In the latter scheme, 
diffusion wavelets are based on construction of an orthonormal 
basis for functions supported on a graph and obtaining a 
custom-designed basis by analyzing eigenvectors of a diffusion 
matrix derived from the graph adjacency matrix. The resulting 
basis vectors are generally localized to neighborhoods of 
varying size and may also lead to sparse representations of 
data on a graph [175]. One example of such an approach is 
where the node data correspond to traffic rates of routers in a 
computer network. 

Implementation of CS in a wireless SN: Two main ap- 
proaches to implement random projections in a SN are dis- 
cussed in the literature [173]. In the first approach, the CS pro- 
jections are simultaneously calculated through superposition 
of radio waves and communicated using amplitude-modulated 
coherent transmissions of randomly-weighted values directly 
from the nodes in the network to the FC (Fig. [27l) . This 
scheme, introduced in [157], [167] and further refined in [176], 
is based on the notion of so-called matched source-channel 
communication [166], [167]. Although the need for complex 
routing, intra-network communications, and processing are 
alleviated, local phase synchronization among nodes is an 
issue to be addressed properly in this approach. 

In the second approach, the projections can be computed 
and delivered to every subset of nodes in the network using 
gossip/consensus techniques, or be delivered to a single point 
using clustering and aggregation. This approach is typically 
used for networked data storage and retrieval applications. In 
this method, computation and distribution of each CS sample is 
accomplished through two simple steps [173]. In the first step, 
each of the sensors multiplies its data with the corresponding 
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Fig. 27. Computation of CS projections through superposition of radio waves 
of randomly weighted values directly from the nodes in the network to the 
FC (from [173]). 



element of the compressing matrix. Then, in the second 
step, the resulting local terms are simultaneously aggregated 
and distributed across the network using randomized gossip 
[177], which is a simple iterative decentralized algorithm for 
computing linear functions. Because each node only exchanges 
information with its immediate neighbors in the network, 
gossip algorithms are more robust to failures or changes in 
the network topology and cannot be easily compromised by 
eliminating a single server or fusion center [178]. 

Finally, it should be noted that in addition to the encod- 
ing process, the overall system performance is significantly 
affected by the decoding process [82], [179], [180]; this study 
and its extensions to sparse SN's remain as challenging tasks. 

2) Sensing Capacity: Despite wide-spread development of 
SN ideas in recent years, understanding of fundamental perfor- 
mance limits of sensing and communication between sensors 
is still under development. One of the issues that has recently 
attracted attention in theoretical analysis of sensor networks, 
is the concept of sensor capacity. The sensing capacity was 
initially introduced for discrete alphabets in applications such 
as target detection [181], and later extended in [15], [182], 
[183] to the continuous case. The questions in this area are 
related to the problem of sampling of sparse signals, [30], [69], 
[169] and sampling with finite rate of innovation [4], [95]. 
In the context of the CS, sensing capacity provides bounds 
on the maximum signal dimension or complexity per sensor 
measurement that can be recovered to a pre-defined degree of 
accuracy. Alternatively, it can be interpreted as the minimum 
number of sensors necessary to monitor a given region to a 
desired degree of fidelity based on noisy sensor measurements. 
The inverse of sensing capacity is the compression rate; i.e., 
the ratio of the number of measurements to the number of 
signal dimensions which characterizes the minimum rate to 
which the source can be compressed. As shown in [15], sens- 
ing capacity is a function of SNR, the inherent dimensionality 
of the information space, sensing diversity, and the desired 
distortion level. 

Another issue to be noted with respect to the sensing capac- 
ity is the inherent difference between sensor network and CS 
scenarios in the way in which the SNR is handled [15], [183]. 
In sensor networks composed of many sensors, fixed SNR 
can be imposed for each individual sensor. Thus, the sensed 
SNR per location is spread across the field of view leading 
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Fig. 28. The BSS concept; the unobservable sources si[i], . . . , Sn[i] are 
mixed and corrupted by additive zero mean noise to generate the observations 
xi[i], . . . , Xm[i]. The target of BSS is to estimate an unmixing system to 
recover the original sources in yi [i] , . . . , i/n [i] ■ 

to a row-wise normalization of the observation matrix. On the 
other hand, in CS, the vector- valued observation corresponding 
to each signal component is normalized by each column. 
This difference has led to different regimes of compression 
rate [183]. In SN, in contrast to the CS setting, sensing 
capacity is generally small and correspondingly the number 
of sensors required does not scale linearly with the target 
sparsity. Specifically, the number of measurements is generally 
proportional to the signal dimension and is weakly dependent 
on target density sparsity. This issue has raised questions on 
compressive gains in power-limited SN applications based on 
sparsity of the underlying source domain. 

VI. Sparse Component Analysis: BSS and SDR 

A. Introduction 

Recovery of the original source signals from their mixtures, 
without having a priori information about the sources and 
the way they are mixed, is called Blind Source Separation 
(BSS). This process is impossible if no assumption about the 
sources can be made. Such an assumption on the sources may 
be uncorrelatedness, statistical independence, lack of mutual 
information, or disjointness in some space [19], [20], [43]. 

The signal mixtures are often decomposed into their con- 
stituent principal components, independent components or are 
separated based on their disjoint characteristics described in a 
suitable domain. In the latter case the original sources should 
be sparse in that domain. Independent Component Analysis 
(ICA) is often used for separation of the sources in the 
former case whereas Sparse Component Analysis (SCA) is 
employed for the latter case. These two mathematical tools are 
described in the following sections followed by some results 
and illustrations of their applications. 

B. Independent Component Analysis (ICA) 

The main assumption in ICA is the statistical independence 
of the constituent sources. Based on this assumption, ICA can 
play a crucial role in the separation and denoising of signals 
(BSS). 

There has been recent research interest in the field of 
BSS due to its practicality in a wide range of problems. 
For example, BSS of acoustic signals measured in a room is 
often referred to as the Cocktail Party problem, which means 
separation of individual sounds from a number of recordings 
in an echoic and noisy environment. Figure [28] illustrates 



the BSS concept, wherein the mixing block represents the 
multipath propagation model between the original sources and 
the microphone measurements. 

Generally, BSS algorithms make assumptions about the en- 
vironment in order to make the problem more tractable. There 
are typically three assumptions about the mixing medium. The 
most simple but widely used case is the instantaneous case, 
where the source signals arrive at the sensors at the same 
time. This has been considered for separation of biological 
signals such as the EEG where the signals have narrow 
bandwidths and the sampling frequency is normally low [184]. 
The generative model for BSS in this case can be easily 
formulated as: 

x[z] = H-s[i] +z/[z] (66) 

where s[i], x[i], and denote respectively the vector of 
source signals, size n x 1, observed signals size m x 1, and 
noise signals size m x 1. H is the mixing matrix of size 
m X n. Generally, the mixing process can be nonlinear (due 
to inhomogenity of the environment and that the medium 
can change with respect to the source signal variations; e.g. 
stronger vibration of a drum as a medium, with louder 
sound). However, in an instantaneous linear case where the 
above problems can be avoided or ignored, the separation 
is performed by means of a separating matrix, W of size 
n X m, which uses only the information contained in x.[i] 
to reconstruct the original source signals (or the independent 
components) as: 

y[z] = W • x[z] (67) 

where y[i] is the estimate for the source signal s[i]. The 
early approaches in instantaneous BSS started from the work 
by Herault and Jutten [185] in 1986. In their approach, 
they considered non-Gaussian sources with equal number of 
independent sources and mixtures. They proposed a solution 
based on a recurrent artificial neural network for separation of 
the sources. 

In the cases where the number of sources is known any 
ambiguity caused by false estimation of the number of sources 
can be avoided. If the number of sources is unknown, a 
criterion may be established to estimate the number of sources 
beforehand. In the context of model identification this is 
referred to as Model Order Selection and methods such as the 
Final Prediction Error (EPF), AIC, Residual Variance (RV), 
MDL and Hannan and Quinn (HNQ) methods [186] may be 
considered to solve this problem. 

In acoustic applications, however, there are usually time 
lags between the arrival times of the signals at the sensors. 
The signals also may arrive through multiple paths. This 
type of mixing model is called a convolutive model [187]. 
The convolutive mixing model can also be classified into 
two subcategories: anechoic and echoic. In both cases the 
vector representations of mixing and separating processes are 
modified as x[i] = H[i] * s[i] + u[i] and y[i] = W[i] * x[i], 
respectively, where * denotes the convolution operation. In an 
anechoic model, however, the expansion of the mixing process 
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may be given as: 

n 

Xr[i] = hrjSj[i — Srj] + [^] , for r = 1, . . . , m (68) 

where the attenuation, /i^^ j , and delay Srj of source j to sensor 
r would be determined by the physical position of the source 
relative to the sensors. Then the unmixing process to estimate 
the sources will be given as: 

m 

W = ^Wj^rXr[i - Sj^r], foi j = 1, . . . , n (69) 

r=l 

where the Wj^s are the elements of W. In an echoic mixing 
environment it is expected that the signals from the same 
sources reach the sensors through multiple paths. Therefore 
the expansion of the mixing and separating models will be 
changed to 

n L 

Xr[i] =^^^^hljSj[i - 5lj] ^ Urli], r = l,...,m (70) 

where L denotes the maximum number of paths for the 
sources, Urii] is the accumulated noise at sensor r, and {.y 
refers to the l^^ path. The unmixing process will be formulated 
similarly to the anechoic one. For a known number of sources 
an accurate result may be expected if the number of paths is 
known; otherwise, the overall number of observations in an 
echoic case is infinite. 

The aim of BSS using ICA is to estimate an unmixing 
matrix W such that Y = WX best approximates the 
independent sources S, where Y and X are respectively 
matrices with columns y[i] = [yi[i]^ y2[^]^ • • • , Uni^]] and 
x.[i] = X2[i], . . . , Thus the ICA separation 

algorithms are subject to permutation and scaling ambiguities 
in the output components, i.e. W = PDH~^, where P 
and D are the permutation and scaling (diagonal) matrices, 
respectively. Permutation of the outputs is troublesome in 
places where either the separated segments of the signals are 
to be joined together or when a frequency-domain BSS is 
performed. 

Mutual information is a measure of independence and max- 
imizing the non-Gaussianity of the source signals is equivalent 
to minimizing the mutual information between them [188]. 

In those cases where the number of sources is more than 
the number of mixtures (underdetermined systems), the above 
BSS schemes cannot be applied simply because the mixing 
matrix is not invertible, and generally the original sources 
cannot be extracted. However, when the signals are sparse, 
the methods based on disjointness of the sources in some 
domain may be utilized. Separation of the mixtures of sparse 
signals is potentially possible in the situation where, at each 
sample instant, the number of nonzero sources is not more 
than a fraction of the number of sensors (see Table [III row 
and column 6). The mixtures of sparse signals can also be 
instantaneous or convolutive. 



C. Sparse Component Analysis ( SCA ) 

While the independence assumption for the sources is 
widely exploited in the design of BSS algorithms, the pos- 
sible disjointness of the sources in some domain has not 
been considered. In SCA, this property is directly employed. 
Blind source separation by sparse decomposition has been 
addressed by Zibulevsky and Pearlmutter [189] for both over- 
determined/exactly-determined and underdetermined systems 
using the maximum a posteriori approach. One way of for- 
mulating SCA is by representing the sources using a proper 
signal dictionary: 

n 

^r\i] = ^Cr,l(t>i\i] (71) 
1=1 

where r = 1, . . . , m and n is the number of basis functions in 
the dictionary. The functions 0^ [i] are called atoms or elements 
of the dictionary. These atoms do not have to be linearly 
independent and may form an overcomplete dictionary. The 
sparsity property requires that only a small number of the 
coefficients Cr^i differ significantly from zero. Based on this 
definition, the mixing and unmixing systems are modeled as 
follows: 

x[i] = As[i] + 

s\i] = C^\i] (72) 

where is an m x 1 vector. A and C can be determined 
by optimization of a cost function based on an exponential 
distribution for Cij [189]. Figure [29] shows three separated 
sources using the above approach. 

In places where the sources are sparse and at each time 
instant, at most one of the sources has significant nonzero 
value, the columns of the mixing matrix may be calculated 
individually, which makes the solution to the underdetermined 
case possible. 

The SCA problem can be stated as a clustering problem 
since the lines in the scatter plot can be separated based on 
their directionalities by means of clustering. A number of 
works on this method have been reported [19], [191], [192]. 
In the work by Li et. ah [192], the separation has been 
performed in two different stages. First, the unknown mixing 
matrix is estimated using the k-means clustering method. 
Then, the source matrix is estimated using a standard linear 
programming algorithm. The line orientation of a data set may 
be thought of as the direction of its greatest variance. One 
way is to perform eigenvector decomposition on the covariance 
matrix of the data, the resultant principal eigenvector, i.e., the 
eigenvector with the largest eigenvalue, indicates the direction 
of the data, since it has the maximum variance. In [191], GAP 
statistics as a metric which measures the distance between 
the total variance and cluster variances, has been used to 
estimate the number of sources followed by a similar method 
to Li's algorithm explained above. In line with this approach, 
Bofill and Zibulevsky [16] developed a potential function 
method for estimating the mixing matrix followed by £i-norm 
decomposition for the source estimation. Local maxima of the 
potential function correspond to the estimated directions of 
the basis vectors. After the mixing matrix is identified, the 
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(a) 




Fig. 29. Separation of sparse signals using the algorithm given in [190], left 
(a) sources, (b) mixtures, and (c) reconstructed sources, in both time-frequency 
left, and time, right. 




(a) (b) 

Fig. 30. (a) the scatter plot and (b) the shortest path from the origin to the 
data point, x[i], extracted from [16]. 

sources have to be estimated. Even when A is known the 
solution is not unique. So, a solution is found for which the li- 
norm is minimized. Therefore, for x[i] = ^^jSj[i], \sj\ 
is minimized using linear programming. 

Geometrically, for a given feasible solution, each source 
component is a segment of length \sj\ in the direction of 
the corresponding and, by concatenation, their sum defines 
a path from the origin to x[i]. ]V[inimizing \sj\ amounts 
therefore to finding the shortest path to x[i] over all feasible 
solutions j = 1, . . . , n, where n is the dimension of space 
of the independent basis vectors [19]. Figure [30l shows the 
scatter plot, polar plot, and the shortest path from the origin 
to the data point x[i]. 

There are many cases for which the sources are disjoint in 



other domains, rather than the time-domain, or when they can 
be represented as sum of the members of a dictionary which 
can consist for example of wavelets or wavelet packets. In 
these cases the sparse component analysis can be performed 
in those domains more efficiently. Such methods often include 
transformation to time-frequency domain followed by a bi- 
nary masking [193] or a BSS followed by binary masking 
[187]. One such approach, called DUET [193], transforms 
the anechoic convolutive observations into the time-frequency 
domain using a short-time Fourier transform and the relative 
attenuation and delay values between the two observations 
are calculated from the ratio of corresponding time-frequency 
points. The regions of significant amplitudes (atoms) are 
then considered to be the source components in the time- 
frequency domain. In this method only two mixtures have 
been considered and as a major limit of this method, only 
one source has been considered active at each time instant. 

For instantaneous separation of sparse sources, the common 
approach used by most researchers is to attempt to maximize 
the sparsity of the extracted signals at the output of the 
separator. The columns of the mixing matrix A assign each 
observed data point to only one source based on some measure 
of proximity to those columns [194], i.e., at each instant only 
one source is considered active. Therefore the mixing system 
can be presented as: 

n 

^rH=^%-,r5j[i] , r=l,...,m (73) 

i=i 

where in an ideal case aj^r = for r ^ j. IMinimization of 
the ^i-norm is one of the most logical methods for estimation 
of sources as long as the signals can be considered sparse, 
^i-norm minimization is a piecewise linear operation that 
partially assigns the energy of x[z] to the m columns of A 
around x[z] in space. The remaining n — m columns are 
assigned zero coefficients, therefore the £i-norm minimization 
can be manifested as: 

min ||s[z] 11^^ subject to A • s[z] = x[z] (74) 

A detailed discussion of signal recovery using £i-norm mini- 
mization is presented by Takigawa et. al. [195] and described 
below. As mentioned above, it is important to choose a domain 
that sparsely represents the signals. 

On the other hand, in the method developed by Pederson 
et. al, as applied to stereo signals, the binary masks are 
estimated after BSS of the mixtures and then applied to the 
microphone signals. The same technique has been used for 
convolutive sparse mixtures after the signals are transformed 
to the frequency domain. 

In another approach [196] the effect of outlier noise has 
been reduced using median filtering then hybrid fast ICA filter- 
ing, and ^i-norm minimization have been used for separation 
of temporomandibular joint sounds. It has been shown that for 
such sources, this method outperforms both the Degenerate 
Unmixing Estimation Technique (DUET) and Li's algorithms. 
The authors of [197] have recently extended the DUET al- 
gorithm to separation of more than two sources in an echoic 
mixing scenario in the time-frequency domain. 
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TABLE VII 
SCA Steps 



1) Consider the model x = A • s, we need a linear trans- 
formation that applies to both sides of the equation to 
yield a new sparse source vector. 

2) Estimate the mixing matrix A. Several approaches are 
presented for this step, such as natural gradient ICA 
approaches, and clustering techniques with variants of 
k-means algorithm [199], [19]. 

3) Estimate the source representation based on the spar- 
sity assumption. A majority of proposed methods are 
primarily based on minimizing some norm or pseudo- 
norm of the source representation vector. The most 
effective approaches are Matching Pursuit [199], [200], 
Basis Pursuit, [189], [201], [78], [202]], FOCUSS 
[203], IDE [66] and Smoothed ^o-norm [204]. 



In a very recent approach it has been considered that brain 
signal sources in the space-time-frequency domain are disjoint. 
Therefore, clustering the observation points in the space-time- 
frequency-domain can be effectively used for separation of 
brain sources [198]. 

As can be seen, generally, BSS exploits independence of 
the source signals whereas SCA benefits from the disjointness 
property of the source signals in some domain. While the BSS 
algorithms mostly rely on ICA with statistical properties of the 
signals, SCA uses their geometrical and behavioral properties. 
Therefore, in SCA, either a clustering approach or a masking 
procedure can result in estimation of the mixing matrix. Often, 
an ^i-norm is used to recover the source signals. Generally, in 
places where the source signals are sparse, the SCA methods 
often result in more accurate estimation of the signals with 
less ambiguities in the estimation. 

D. SCA Algorithms 

There are three main steps for the solution of an SCA 
problem as shown in Table IVIll [1991. The first step of Table 
IVIII shows a linear model for the SCA problem, the second 
step consists of estimating the mixing matrix A using sparsity 
information, and finally the third step is to estimate the sparse 
source representation based on the estimate of A. 

In the following, we present a survey of major approaches 
that are suggested for the third step. 

1) Matching Pursuit: Mallat and Zhang have developed a 
general iterative method for approximating sparse decomposi- 
tion [200]. When the dictionary is orthogonal and the signal 
X is composed of /c <C n atoms, the algorithm recovers the 
sparse decomposition exactly after n steps. As we will see, 
the algorithm is greedy [205]. Since the algorithm is myopic, 
in some certain cases, wrong atoms are chosen in the first 
few iterations, and thus the remaining iterations are spent on 
correcting the first few mistakes. The algorithm is shown in 
Table [Villi 

2) Basis Pursuit: The mathematical representation of 
counting the number of sparse components is denoted by ^Q. 
However, ^Q is not a proper norm and is not computationally 
tractable. The closest convex norm to ^Q is ^i. The ii opti- 
mization of an over complete dictionary is called Basis Pursuit. 



TABLE VIII 
Matching Pursuit Algorithm 



1) 


Let x(0) = Onxi, r(0) = x and i = 1. 


2) 


Find index 7^ such that (r(*~-'^\ a-,,. } is maximum 




where corresponds to columns of the mixing matrix 




A (atoms). 


3) 


Let s^ = (r(^-i),a^J, x(^) = x(*-i) + s^a^^ and 




=x-xW. 


4) 


If the last condition is not satisfied, increase i and 




return to step 2. 



TABLE IX 

RELATION BETWEEN LP AND BASIS PURSUIT (THE NOTATION FOR 
LINEAR PROGRAMMING IS FROM [207].) 



Basis Pursuit 


Linear Programming 
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However the £i-norm is non-differentiable and we cannot use 
gradient methods for optimal solutions [206]. On the other 
hand, the ii solution is stable due to its convexity (the global 
optimum is the same as the local one) [21]. 

Formally, the Basis Pursuit can be formulated as: 

min||s||£^ s.t. x = A-s (75) 

We now explain how the Basis Pursuit is related to Linear 
Programming (LP). The standard form of linear programming 
is a constrained optimization problem defined in terms of 
variable x G by: 

min C^x s.t. Ax = b, Vi : > (76) 

where C^x is the objective function. Ax = b is a set of 
equality constraints and Vz : > is a set of bounds. 
Table [IXl shows this relationship. Thus, the solution of (TTSl) 
can be obtained by solving the equivalent LP. There are two 
major approaches to solve LP: 1) Interior Point methods and 
2) Simplex algorithms, depending on whether we solve the 
cost function and then check whether it satisfies the constraint 
bounds or vice versa. 

3) FOCal Underdetermined System Solver (FOCUSS): 
FOCUSS is a non-parametric algorithm that consists of two 
parts [203]. It starts by finding a low resolution estimation 
of the sparse signal, and then pruning this solution to a 
sparser signal representation through several iterations. The 
solution at each iteration step is found by taking the pseudo- 
inverse of a modified weighted matrix. The pseudo-inverse 
of the modified weighted matrix is defined by (AW)+ = 
(AW)^(AW • (AW)^)"^ This iterative algorithm is the 
solution of the following optimization problem: 

Find s = Wq, where: min ||q||£2 s.t. x = AWq (77) 

Description of this algorithm is given in Table |Xl and an 
extended version is discussed in [203]. 
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TABLE X 
FOCUSS (BASIC) 



• Step 1: Wp. = diag{si-i) 

• Step 2: cii = (AWp.)^x 

• Step 3: Si = Wp. • 



TABLE XI 
IDE STEPS 



• Detection Step: Find indices of inactive sources: 

m 

I = {l <i<m: |af • X -"^s^jSif • aj \ < e^} 

• Estimation Step: Find the following projection as the 
new estimate: 

I 

s^+^ = argming ^ sf s.t. x(f) = A • s{t) 

The solution is derived from Karush-Kuhn-Tucker sys- 
tem of equations. At the ^ + 1*^ iteration: 

Si = Af • P(x - Aa • Sa) 

= (A^PAa)-^A^P.x 

where the matrices and vectors are partitioned into 
inactive/active parts as Ai, Aa,Si,Sa and P = 

(A.Af)-^ 

• Stop after a fixed number of iterations. 



4) Iterative Detection and Estimation (IDE): The idea 
behind this method is based on a geometrical interpretation of 
the sparsity. Consider the elements of vector s are i.i.d. random 
variables. By plotting a sample distribution of vector s, which 
is obtained by plotting a large number of samples in the S- 
space, it is observed that the points tend to concentrate first 
around the origin, then along the coordinate axes, then across 
the coordinate planes. The algorithm used in IDE is given in 
Table |Xll In this table, s^'s are the inactive sources, Sa's are 
the active sources, is the column of A corresponding to 
the inactive and A^ is the column of A corresponding to 
the active s^. Notice that IDE has some resemblances to the 
RDE method discussed in Sec. IIII-A.2I IMAT mentioned in 
Sec. IIII-A.21 and MIMAT explained in Sec. IVII-A.21 

5) Smoothed £o-norm (SLO) Method: As discussed earlier, 
the criterion for sparsity is the ^o-norm; thus our minimization 
is 

min||s||£o s.t. A-s = x (78) 

The ^o-norm has two major drawbacks: the need for a com- 
binatorial search, and its sensitivity to noise. These problems 
arise from the fact that the ^o-norm is discontinuous. The idea 
of SLO is to approximate the ^o-norm with functions of the 
type [204]: 

f^{s)^e-f^ (79) 



TABLE XII 
SLO STEPS 



• Initialization: 

1) Set So equal to the minimum -^2 -norm solution 
of As = X, obtained by pseudo-inverse of A. 

2) Choose a suitable decreasing sequence for a, 
[cri, . . . ,crx]. 

• For i = 1, . . . , K: 

1) Set a = ai, 

2) Maximize the function Fa on the feasible set 
S = {s|As = x} using L iterations of the 
steepest ascent algorithm (followed by projection 
onto the feasible set): 

- Initialization: s = s^-i. 

- for j = 1, . . . , L (loop L times): 

a) Let: As = [sie~ 2^ Sne~ 2^]^. 

b) Set s ^ s — /iAs (where is a small 
positive constant). 

c) Project s back onto the feasible set *S: 

s ^ s - A^ (AA^) (As - x) 

3) Set Si = s. 

• Final answer is s = sx 



where a is a parameter which determines the quality of the 
approximation. Note that we have 

For the vector s, we have ||s||o ~ n — F(j{s), where 
— Xir=i fcri^i)- Now minimizing ||s||o is equivalent to 
maximizing Fcr{s) for some appropriate values of a. For small 
values of a, Fcr{s) is highly non-smooth and contains many 
local maxima, and therefore its maximization over A • s = x 
may not be global. On the other hand, for larger values of 
cr, F(j{s) is a smoother function and contains fewer local 
maxima, and its maximization may be possible (in fact there 
is no local maxima for large values of a [204]). Hence we use 
a decreasing sequence for a in the steepest ascent algorithm 
and may escape from getting trapped into local maxima and 
reach the actual maximum for small values of a, which gives 
the minimum ^o-norm solution. The algorithm is summarized 
in Table Kill 

6) Comparison of Different Techniques: The above tech- 
niques have been simulated and the results are depicted in 
Fig. [3TI In order to compare the efficiency and computational 
complexity of these methods; we use a fixed synthetic mix- 
ing matrix and source vectors. The elements of the mixing 
matrix are obtained from zero mean independent Gaussian 
random variables with variance a = 1. Sparse sources have 
been artificially generated using a Bernoulli-Gaussian model: 
Si = p N{0,aon) + (1 -p) ^"(0,^0//). We set do// = 
0.01, a on = 1 and p = 0.1. Then, we compute the noisy 
mixture vector x from x = As + where v is the noise 
vector. The elements of the vector v are generated according 
to independent zero mean Gaussian random variables with 
variance a^. We use Orthogonal Matching Pursuit (OMP) 
which is a variant of Matching Pursuit [200] . OMP has a better 
performance in estimating the source vector in comparison to 
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Fig. 31. Performance of various methods with respect to the standard 
deviation. 
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Fig. 33. Linear least squares: e = (P{h,V) + (P{f2,V) + ^^(/g^ y). 
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Fig. 34. Objective function: (a) e = (P{fi,V2) + d?{f2.Vi) + d?{h,Vi) 
and (b) e = (P{h,V2) + (P (/2 , V2 ) + (/g , ) . Configuration of Vi , ^2 
in (a) creates the partition Pi = {/i} and P2 = {/2,/3} while the 
configuration in (b) causes the partition Pi = {/i, /2} and P2 = {/s}- 



Matching Pursuit. Fig. [32l demonstrates the time needed for 
each algorithm to estimate the vector s with respect to the 
number of sources. This figure shows that IDE and SLO have 
the lowest complexity. 

E. Sparse Dictionary Representation (SDR) and Signal Mod- 
eling 

A signal x G may be sparse in a given basis but not 
sparse in a different basis. For example, an image may be 
sparse in a wavelet basis (i.e., most of the wavelet coefficients 
are small) even though the image itself may not be sparse (i.e., 
many of the gray values of the image are relatively large). 
Thus, given a class S C M"^, an important problem is to find a 
basis or a frame in which all signals in S can be represented 
sparsely. More specifically, given a class of signals 5 C M"^, 
it is important to find a basis (or a frame) D = {wj}^^^ 
(if it exists) for such that every data vector x G 5 can be 
represented by at most /c <C n linear combinations of elements 
of D. The dictionary design problem has been addressed in 
[19], [20], [21], [81], [84], [95], [208]. A related problem is 
the signal modeling problem in which the class S is to be 
modeled by a union of subspaces M. = Ui=i ^« where each Vi 
is a subspace of with a dimension ofVi<k where k <^ n 
[43]. If the subspaces Vi are known, then it is possible to pick 
a basis E'^ = {e)}j for each Vi and construct a dictionary 
D = Ui=i which every signal of S has sparsity k (or is 
almost k sparse). The model Ai = Ui=i t)e found from 

an observed set of data F = {/i, . . . , /^} C 5 by solving (if 
possible) the following non-linear least squares problem: 



Find subspaces , . . . , of that minimize the expres- 
sion 

m 

e{F,{V^,...,Vl})=Y,,^^d\f^,VJ) (81) 

over all possible choices of / subspaces with dimF^ < k < 
N. Here d denotes the Euclidian distance in R^ and k is 
an integer with 1 < k < n for i = 1,...,/. Note that 
e(F, {Fi, . . . , V/}) is calculated as follows: for each fi ^ F 
and fixed {Fi, . . . , VJ}, the subspace Vi G {Fi, . . . , VJ} closest 
to fi is found and the distance d'^{fi^Vj) is computed. This 
process is repeated for all fi e F and the squares of the 
distances are added together to find e(F, {Fi, . . . , FJ). The 
optimal model is then obtained as the union M = [j^ Vf, 
where {V^^, • • • , minimize the expression ([811 . When / = 
1 this problem reduces to the classical least squares problem. 
However, when / > 1 the set [j^Vi is a nonlinear set and 
the problem is fully non-linear (see Figs. [34l and [35]). A more 
general nonlinear least squares problem has been studied for 
finite and infinite Hilbert spaces [43]. In that general setting, 
the existence of solutions is proved and a meta- algorithm for 
searching for the solution is described. 

For the special finite dimensional case of R^ in ([8T1) . 
the search algorithm is an iterative algorithm that alternates 
between data partition and the optimization of a simpler least 
squares problem. The search algorithm can be summarized as 
in Table [Xlllj 

The above algorithm, which is similar to k-means, consists 
of two parts [209]: 1) an initialization; and 2) an iterative 
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TABLE XIII 
Search Algorithm 




Fig. 35. Data F belongs to two planes in M^. The algorithm uses a random 
initial partition in left hand side, and produces the final partition and optimal 
subspaces in the right hand side. 

search algorithm. The initiaUzation is a Hough-Hke transform 
that partitions the data set F into / classes {F^, . . . , F^^}, 
such that each class Ff consists of vectors that are close to 
the same (yet unknown) subspace . For each partition F/ 
we find the subspace closest to the vectors in F/ among 
all subspaces F C of dimension dimV < k. Finding the 
subspace closets to the vectors in F/ for a fixed i is the 
classical linear least squares problem and can be found exactly 
using a Singular Value Decomposition. This initialization 
process produces a first approximation { , • • • , } of the 
optimal spaces {Vi ^ . . . ^Vf^ minimizing (ISTI) . We now use 
the subspaces {F^^, . . . , ]^^} to find a new partition of the 
data Ff , . . . , F'^ in such a way that the vectors in Ff consist 
of all the vectors in F that are closest to F/, i = 1, . . . , /. 
Using this new partition F^^, . . . , F^ of F, we find (for each 
i) the subspace closest to the vectors in F^ using SVD 
as before. We continue this process until the value of e is 
unchanged between two consecutive iterations. This is a local 
minimum which is also likely to be a global one (see Fig. [35]) . 

VII. MuLTiPATH Channel Estimation 

In wireless systems, channel estimation is required for the 
compensation of channel distortions. The transmitted signal 



bounces off different objects and arrives at the receiver from 
multiple paths. This phenomenon causes the received signal 
to be a mixture of reflected and scattered versions of the 
transmitted signal. The mobility of the transmitter, receiver, 
and scattering objects results in rapid changes in the channel 
response, and thus the channel estimation process becomes 
more complicated. Due to the sparse distribution of scattering 
objects, a multipath channel is sparse in the time domain as 
shown in Fig. [36l By taking sparsity into consideration, chan- 
nel estimation can be simplified and/or made more accurate. 
The sparse time varying multipath channel is modeled as: 

k-l 

h{t,T)=Y,Mmt-r) (82) 

where k is the number of taps, ai is the l^^ complex path gain, 
and Ti is the corresponding path delay. At time t, the transfer 
function is given by: 

/ + 00 
h{t,r)e-^^''f^dT (83) 
-oo 

The estimation of the multipath channel impulse response is 
very much similar to the determination of analog epochs and 
amplitudes of discontinuity for finite rate of innovation as 
shown in section llI-B.4I Fig.[8land ([T9l) . Essentially, if a known 
train of impulses is transmitted and the received signal from 
the multipath channel is filtered and sampled (information 
domain as shown in Fig. [8] -rate of innovation), the channel 
impulse response can be estimated from these samples using 
an annihilating filter (the Prony method [28]) defined in 
the ^-transform and a pseudo-inverse matrix inversion, in 
principleEl. Once the channel impulse response is estimated, its 
effect is compensated; this process can be repeated according 
to the dynamics of the time varying channel. 

A special case of multipath channel is an OFDM channel, 
which is widely used in ADSL, DAB, DVB, WLAN, WMAN, 
and WIMA^O. OFDM is a digital multi-carrier transmission 
technique where a single data stream is distributed over several 
sub-carrier frequencies to achieve robustness against multipath 
channels besides many other advantages. Channel estimation 
in OFDM can be easier than for other modulation schemes; 
the channel impulse response is now quantizecQ and instead 
of an annihilating filter defined in the ^-transform, we can 
use DFT and ELP of section IIII-AI Also, instead of a known 
train of impulses, some of the available sub-carriers of OFDM 
in each transmitting block are assigned to predetermined 
patterns, which are usually called comb-type pilots. These pilot 
tones help the receiver to extract some of the DFT samples 
of the discrete time varying channel (1821) at the respective 
frequencies in each transmitting block. These characteristics 
make the OFDM channel estimation similar to unknown sparse 
signal recovery of section III-A.ll and the impulsive noise 
removal of section IIII-A.2I Because of these advantages, our 

Similar to Pisarenko method for spectral estimation [28]. 

These acronyms are defined in Table IXVl at the end of the paper. 
^^For simplicity, we assume a time-quantized channel impulse response 
which works quite well. For an accurate channel estimation of OFDM channel 
with AWGN, we can use fractional time delays. 



• Input: 

- initial partition {F^,.. .^F^} 

- Data set J=' 

• Iterations : 

1) Use the SVD to find {y^^ , ... , } by minimiz- 
ing ei^F^^VA for each i, and compute Fi = 
Eie(F/,y/); 

2) Set J = 1; 

3) While r,- = E«e(F/,V;^) > 
e(^, {!//,..., y/}) 

4) Choose a new partition {f('^^ , . . . , Fj''^^} that 
satisfies, / e implies that d{f,V^) < 
d{f,vi),h = l,...,l- 

5) Use SVD to find and choose 
{vi^^ , . . . ,V^^^}, by minimizing 
e(^F^^^ ^Vi) for each i, and compute 
r,+i=Eie(F/+\;^^+i); 

6) Increment j by 1, i.e., j ^ j + 1; 

7) End while 

• Output: 

- Pj and Vp. . 
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Fig. 36. The impulse response of two typical multipath channels; (a) Brazil-D 
and (b) TU6 channel profiles. 



main example and simulations are related to OFDM channel 
estimation. 



A. OFDM Channel Estimation 

For OFDM, the discrete version of the time varying channel 
of ([83]) in the frequency domain becomes 



H{rTf,zAf) 



k-l 

E 



h[r, l]t 



where 



h[r,l]=h{rTf,lTs) 



(84) 



(85) 



where Tf and n are the symbol length and number of sub- 
carriers in each OFDM symbol, respectively. A/ is the sub- 
carrier spacing, and Tg = ^ is the sample interval. The above 
equation shows that for the r^^ OFDM symbol, H[r, i] is the 
DFT of h[r, I]. 

Two major methods are used in the equalization process: 
1) zero forcing and 2) MMSE. In the zero forcing method, 
regardless of the noise variance, equalization is obtained 
by dividing the received OFDM symbol by the estimated 
channel frequency response; while in the MMSE method, the 
approximation is chosen such that the MSE of the transmitted 
data vector [||X — X|p] ) is minimized, which introduces 
the noise variance in the equations. 

1) Statement of the Problem: The goal of the channel 
estimation process is to obtain the channel impulse response 
from the noisy values of the channel transfer function in the 
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Fig. 37. Graphical representation of the sparse problem involved in OFDM 
channel estimation 



pilot positions. This is equivalent to solving the following 
equation for h which is also shown graphically in Fig. [37l 



F, h 



(86) 



where ip is an index vector denoting the pilot positions in 
the frequency spectrum, H^^ is a vector containing the value 
of the channel frequency spectrum in these pilot positions and 
denotes the matrix obtained from taking the rows of the 
DFT matrix pertaining to the pilot positions. is the additive 
noise on the pilot points in the frequency domain. Thus, the 
channel estimation problem is equivalent to finding the sparse 
vector h from the above set of equations for a set of pilots. 
Various channel estimation methods [210] have been used with 
the usual tradeoffs of optimality and complexity. The Least 
Square [210], ML [211], [212], Minimum Mean Square Error 
(MMSE) [213], [214], [215], and LMMSE [213], [211], [216] 
techniques are among some of these methods. But none of 
these techniques use the inherent sparsity of the multipath 
channel h, and thus, they are not as accurate. 

2) Sparse OFDM Channel Estimation: In the following, 
we present two methods that utilize this sparsity to enhance 
the channel estimation process. 

CS Based Channel Estimation: Recently the idea of using 
time-domain sparsity in OFDM channel estimation has been 
proposed by [217], [218], [219]. The use of sparsity decreases 
the channel estimation error and hence the number of required 
pilots (overhead), thus increasing the bandwidth efficiency. In 
[217], the authors proposed to use CS for OFDM channel 
estimation and proved that, in case of uniform pilot insertion, 
the OFDM channel estimation problem satisfies the Restricted 
Isometric Property (RIP) described in section III-B[ and thus 
LP-based algorithms similar to the ones discussed in section 
IVI-D.2I can be used for channel estimation. Simulation results 
show that this method works effectively even in fast time 
varying channels. Furthermore, from (|7]), a lower bound on 
the number of pilots can be obtained for a given number of 
channel taps (sparsity). 

However, the authors of [217] did not consider zero padding 
at the endpoints of the OFDM bandwidth in their scenario 
which is an essential part of the current standards based 
on OFDM transmission. This assumption causes the matrix 
Fi defined in ([86b to be ill-conditioned and thus, the RIP 
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Fig. 38. Graphical representation of the equation involved in obtaining the 
tap values above the threshold 

condition defined in (O is not satisfied. Also we do not have 
any pilots in the zero padded parts which complicates the 
channel estimation techniques. In [24], we propose a method 
that exploits the inherent sparsity and also solves the zero 
padding problem. This algorithm is briefly discussed in the 
following. 

Modified IMAT (MIMAT) for OFDM Channel Estimation: 
In this method, the spectrum of the channel is initially esti- 
mated using a simple interpolation such as linear interpolation 
between pilot sub-carriers. This initial estimate is further 
improved in a series of iterations between time (sparse) and 
frequency (information) domains to find the sparsest channel 
impulse response by using an adaptive thresholding scheme; in 
each iteration, after finding the location of the taps (locations 
with previously estimated amplitudes higher than the thresh- 
old), their respective amplitudes are again found using the 
MMSE criterion. In each iteration, due to thresholding, some 
of the false taps that are noise samples with amplitudes above 
the threshold are discarded. Thus, the new iteration starts with 
a lower number of false taps. Moreover, because of the MMSE 
estimator, the valid taps approach their actual values in each 
new iteration. In the last iteration, the actual taps are detected 
and the MMSE estimator gives their respective values. This 
method is similar to RDE and IDE methods discussed in 
section IIII-A.2I and IVI-D.4[ 

Table IXIVI summarizes the steps in the MIMAT algorithm. 
In the threshold of the MIMAT algorithm, a and (3 are 
constants which depend on the number of taps and initial 
powers of noise and channel impulses. In the first iteration, 
the threshold is a small number and with each iteration it 
is gradually increased. Intuitively, this gradual increase of 
the threshold with the iteration number, results in a gradual 
reduction of false taps (taps that are created due to noise). In 
each iteration, the tap values are obtained from: 

Hl5v = Hi, + I/., = F • h, (87) 

where t denotes the index of nonzero impulses obtained from 
the previous step and F is obtained from F^^ by keeping 
the columns determined by t. A graphical representation of 
the above equation is given in Fig. [38l The amplitudes of 



TABLE XIV 

MIMAT Algorithm for OFDM channel estimation 



• Initialization: 




Find an initial estimate of the time domain channel 




using linear interpolation: h^^^ = hunear 


• Iterations: 


1) 


Set Threshold=;5e°^\ 


2) 


Using the threshold from the previous step, 




find the locations of the taps t by threshold- 




ing the time domain channel from the previous 




iterationC/i^'^-i)). 


3) 


Solve for the value of the non-zero impulses 




using MMSE: 




fit = SNR- F^{F ■ SNR-F^ (88) 


4) 


Find the new estimate of the channel (/i^*)) by 




substituting the taps in their detected positions. 


5) 


Stop if the estimated channel is the same as the 




previous iteration or when a maximum number 




of iterations is reached. 




CNR (dB) 

Fig. 39. SER (Symbol Error Rate) vs. CNR (Carrier to Noise Ratio) for the 
ideal channel, linear interpolation, and the MIMAT for the Brazil channel. 

nonzero impulses can be obtained from simple iterations, 
pseudo-inverse, or the MMSE equation ([88]) of Table IXIVI that 
yields better results under additive noise environments. 

The equation that has to be solved in ([87l) is usually over- 
determined which helps the suppression of the noise in each 
iteration step. Note that the solution presented in ([88l) , rep- 
resents a variant of the MMSE solution when the location of 
discrete impulses are known. If further statistical knowledge is 
available, this solution can be modified and a better estimation 
is obtained; however, this makes the approximation process 
more complex. This algorithm does not need many steps of 
iterations; the positions of the non-zero impulses are perfectly 
detected in 3 or 4 iterations for most types of channels. 

B. Simulation Results and Discussions 

For OFDM simulations, the DVB-H standard was used 
with the 16-QAM constellation in the 2K mode (2^^ EFT 
size). The channel profile was the Brazil channel D. Fig. 
[39l shows the Symbol Error Rate (SER) versus the Carrier- 
to-Noise Ratio (CNR) after equalizing using the proposed 
MIMAT algorithm and the standard linear interpolation in the 
frequency domain using the noisy pilot samples. As can be 
seen in Fig. [39l the SER obtained from the MIMAT algorithm 



34 




Fig. 40. SER (Symbol Error Rate) vs. CNR (Carrier to Noise Ratio) of 
MIMAT method for the Brazil channel with various Doppler frequencies. 



analysis can be applied to other applications with possible 
reduction of sampling rate. As such, this tutorial has provided 
the route for new applications of sparse signal processing to 
emerge, which can potentially reduce computational complex- 
ity and improve performance quality. 

Appendix I 

Acceleration Methods: Chebyshev and Conjugate 
Gradient (CG) [65] 

A. Chebyshev Algorithm 
m Initialization: 



coincides with the one obtained from the hypothetical ideal 
channel (where the exact channel frequency response is used 
for equalization). Thus, in this sense the proposed channel 
estimation is perfect in time invariant channels. Also, this 
figure shows that the standard linear interpolation method 
performs poorly compared to MIMAT. 

This estimation technique is highly robust in rapidly chang- 
ing channels and shows only minor performance degradation 
as the Doppler frequency increases as shown in Fig. |40l 

VIII. Conclusion 

A unified view of sparse signal processing has been pre- 
sented in tutorial form. The sparsity in the key areas of 
sampling, coding, spectral estimation, array processing, com- 
ponent analysis, and channel estimation has been carefully 
exploited. Some form of uniform or random sampling has been 
shown to underpin the associated sparse processing methods 
used in each of these fields. The reconstruction methods used 
in each application domain have been introduced and the 
interconnections among them have been highlighted. 

This development has revealed; for example, that the it- 
erative methods developed for random sampling can be ap- 
plied to real-field block and convolutional channel coding 
for impulsive noise (salt-and-pepper noise in the case of 
images) removal, sparse component analysis, and channel 
estimation for orthogonal frequency division multiplexing sys- 
tems. These iterative reconstruction methods have been shown 
to be naturally extendable to spectral estimation and sparse 
array processing due to their similarity to channel coding 
in terms of mathematical models. Conversely, the minimum 
description length method developed for spectral estimation 
and array processing has potential for application in other 
areas. The error locator polynomial method developed for 
channel coding has, moreover, been shown to be a discrete 
version of the annihilating filter used in sampling with a finite 
rate of innovation and the Prony method in spectral estimation; 
the Pisarenko and MUSIC methods are further improvements 
of the Prony method. 

Linkages with emergent areas such as compressive sensing 
and sensor networks have also been considered. In addition, 
it has been suggested that the linear programming methods 
developed for compressive sensing and sparse component 
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where S and V are the sampling and filtering operators, 
respectively. Also, A and B are frame bound parameters and 
N is the number of iterations. 



B. Conjugate Gradient Algorithm 
• Initialization: 
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Pn+i\i] = r„+i\i] - X„p„\i] 



(92) 



where S and V are the sampUng and filtering operators, 
respectively. N is the number of iterations and (a;[i],y[i]) 
denotes the inner product of the two functions x[i] and y[i]. 
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Appendix II 
ELP Decoding for Erasure Channels [53] 

For lost samples, the polynomial locator for the erasure 
samples is 

k k 



H{zi)= n {z.-e'-^)=Y,htZ 



k-t 



(93) 



t=0 



HiziJ = 0, 



m 



1,2, 



(94) 



where Zj 



' . The polynomial coefficients ht^ t 



0, . . . , /c can be found from the product in (l93l) ; it is easier to 
find ht by obtaining the inverse FFT of H{z). By multiplying 
([94l) by e[im] • (^i^) (where r is an integer) and summing 
over m, we get 



k 



E 



.k-\-r-t 



= 



(95) 



t=0 m=l 

Since the inner summation is the DFT of the missing samples 

e[im], we get 



k 

E 

t=o 



ht ■ E[k - 



t]=0 



(96) 



where E[.] is the DFT of e[i]. The received samples, d[i], 
can be thought of as the original over-sampled signal, x[i], 
minus the missing samples e[im]- The error signal, e[i], is the 
difference between the corrupted and the original over- sampled 
signal and hence is equal to the values of the missing samples 
for i = im and is equal to zero otherwise. In the frequency 
domain, we have 

E[j]=X[j]-D[ji j = h...,n (97) 

Since X[j] = for j G 9 (see the footnote on page [TOb, 
then 

m = -D[j]. (98) 
The remaining values of E[j] can be found from (l96b , by 
the following recursion: 



E[r] = —^hk-tE[r^t] 



(99) 



where r ^ 6 and the index additions are in mod{n). 



Appendix III 

ELP Decoding for Impulsive Noise Channels [32], 

[96] 

For all integer values of r such that r G 6 and r + /c G 9, 
we obtain a system of k equations with k -\- 1 unknowns (ht 
coefficients). These equations yield a unique solution for the 
polynomial with the additional condition that the first nonzero 
ht is equal to one. After finding the coefficients, we need to 
determine the roots of the polynomial (|93]) . Since the roots 
of H{z) are of the form e^~^ , the inverse DFT (IDFT) of 
the {hm}m=o t)e used. Before performing IDFT, we have 
to pad n — 1 — k zeros at the end of the {hm}m=o sequence 
to obtain an n-point signal. We refer to the new signal (after 
IDFT) as {Hi}^~Q. Each zero in {Hi} represents an error in 
r[i] at the same location. 



Appendix IV 
Proofs for MDL formulas 

Proof of (1581) : The eigenvectors of R^^^ are the same as 
the ones for R and its eigenvalues are the reciprocals of the 
eigenvalues of Rm l and we know the eigenvectors constitute 
an orthonormal set. Thus we have: 

k n 

tr(R^i^R) = tr((^A-ivivf +<T^i ^ v^vf) 

i=l i=fe+l 



•(E^^^^^f)) 



= tr(^v,vf + ^ ^v.vf) (100) 

where tr{.) represents the trace operator on matrices. We know 
that if both AB and BA are defined, tr(AB) = tr(BA). 
Thus: 



tr(v^vf) =tr(vfv^) 



(101) 



and since v/^ v, = 1 for z = 1, . . . , n, we have: 



A. 



tr(R-R) = E^KvfvO+ E -t^Kvfv.) 

(102) 



i=/c+l ^ML 



We also have af^^ = X^IL/c+i which results in: 

tr(R^^K) = k^n-k = n (103) 
Proof of ([60b : The term ^ logm is the penalty function and 



we have: 



log/(x;RML) 



log 



KRmlI^ 

mlog(7r) +mlog(|RML|) 
+tr(R^i^R) 
= mlog|RML| 

+ n + mlog(7r) (104) 

V ' 

C{m) 

The first term of the above equation can be written as: 

k 

mloglRMLl = mlog((nAi)(((T2,^r-'=)) 



= m 



-\-m{n — k) lor ^ ^^'^ 



(105) 



therefore, 

n 

-log/(x;RML) = m^\og(\i) 

-\-m{n — k) \o{ 
+C(m) 



\n-k ^ 



(106) 
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where C{m) only depends on the parameter m and not k. 
Thus we can ignore this term in the MDL criterion. 

Appendix V 
Sparse Matrix Manipulations [34], [35], [36] 

Many physical phenomena and engineering problems are 
represented by a mathematical model very often in the form 
of ordinary, or partial differential equations. The explicit solu- 
tions of these equations are rarely available, unless for special 
cases. It is customary to try to find approximate solutions 
to these problems by making them discrete and linear. The 
resulting linear system involves very large matrices, which 
hopefully contain many zeros. A matrix, with very few non- 
zero elements is called a sparse matrix. 

Solving large linear systems are difficult as it is very time 
consuming, costly and laborious, unless the matrices involved 
are sparse and have many zero elements. The sparse property 
of the matrix may reduce the storage and computing time 
considerably, crucial to solving the linear systems that repre- 
sent such matrices. This is achieved by storing and computing 
only non-zero entries of the matrix. Various techniques can be 
applied to solve such a linear system [220]. 

There are two main approaches in finding admissible solu- 
tions for sparse linear systems, 1) Direct Methods [35], and 
2) Iterative Methods [34], [221], [222]: 

1) Direct Methods give the exact solutions in a finite 
number of elementary operations, provided that there are 
no rounding errors. Direct methods for a sparse linear 
system fall into three categories: (i) Gaussian elimination 
techniques, (ii) Triangular factorization, in particular us- 
ing decomposition techniques, such as, LU, Incomplete 
Lower and Upper (ILU), and Cholesky factorization, (iii) 
Householder reduction to upper triangular form. Direct 
methods, in real applications, have an advantage of 
providing solutions with robust and predictable behavior. 
Detailed analysis and discussion of these methods have 
been well established, and can be found in [35], [36], 
and references therein. 

2) Iterative Methods consist of Jacobi, Gauss-Siedel, and 
Successive Over-Relaxation (SOR). The most suitable 
method for the sparse linear system is the projection 
method, and in particular the Krylov subspace method. 
With the new development in iterative methods for ap- 
proximate solution of linear systems, it has been realized 
that new iterative techniques of projection methods, and 
in particular the combination of preconditioning and 
projection onto Krylov subspace iterations, can be a 
simple and efficient solver of sparse linear systems, and 
can compete with the direct methods in its applications 
[34], [221]. 

Each of the above two methods is suitable for different 
classes of matrices. Direct methods are best applied to those 
classes that produce only a few fill-ins - a fill-in being a 
new non-zero entry created at a position where the original 
matrix contains a zero entry after a linear algebraic operation is 
performed. Preconditioned iterative methods are the preferred 
choice for the class of matrices that produce many fill-ins in 
the process. 



There are two types of sparse matrices: matrices with 
regularly patterned non-zeros (group I), and matrices with 
irregularly structured matrices (group II). Such a distinction 
between matrices is particularly important in the iterative 
solution methods since algebraic operations for the group 
I of matrices can be significantly reduced using a com- 
puter. There is a large number of different sparse matri- 
ces archived and accessible on the website managed by T. 
Davis, The University of Florida Sparse Matrix Collection, 
http://www.cise.ufl.edu/research/sparse/matrices 



A. Solution of Linear Systems 
Consider the linear system 

Ax = b (107) 

where A is an m x n matrix, x is the unknown, and b is an 
m X 1 vector. For the exact solution of this system, there are 
three cases: 

1) The square matrix A (m = n) is invertible (non- 
singular): there is a unique solution x = A~^b. 

2) The matrix A is singular and b G 7^(A) (b belongs 
to the Range of A): there are infinitely many solutions 
X = xo + V, for all v G Ker{A) (Null space of A), 
where xq is a particular solution of Ax = b. 

3) The matrix A is singular and b ^ 1Z{A)\ there is no 
solution. 

However, for large n, calculating the inverse of A is a 
complex task. Further, if approximate solutions are found, it 
may be difficult to estimate how accurate they are. This will 
depend on the entries of matrix A. Further, for the case 2 
above, where n < m and rank{A) = n, denote the pseudo 
inverse by A+ = (A^A)~^A^, the problem is converted to 
finding the vector x = A+b (an approximate solution) subject 
to the following conditions: 

• X satisfies the least square solution. Namely, vector x 
minimizes the norm of ||r||, where the residual vector 
r = b — Ax. 

• X is the solution of the system Ax = b — r, when A^r = 
0. 

Note that in the case that A is the product of certain special 
matrices (diagonal, orthogonal, triangular, and other invertible 
matrices), the solution can be found by various direct methods 
[34], [35], [36]. 

B. Iterative Methods 

Consider an n x n real coefficient matrix A and a real n- 
vector b in linear system (I1Q7I) , the decomposition of A as: 

A = L + D + U (108) 

where D is the diagonal matrix of A with all non-zero entries, 
and L and U are the strict lower and upper matrices of A, 
respectively. The iterative solution vector x^+i is given by 

x/e+i = -(D + L)"^ [Ux/e - b] (Gauss-Seidel) 
x/e+i = -D-^ [(L + U)x/e - b] (Jacobi) (109) 
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or in a general form, 

Xfc+i = Gxfc+f (110) 

where in Jacobi iterations G = Gj(A) = I — D~^A, f = 
D~^b, and in Guass-Seidel iteration G = Ggs'(A) = I — 
(D + L)~^ A, f = (D + L)~^b. This iteration can be viewed 
as a technique for solving the linear system: 

(I-G)x = f, 
or M"^Ax = M"^b (111) 

where the precondition matrix M is given by Mj = D, and 
M-Gs = D + L. The iterations given above converge and the 
limit is a solution of the original linear system (for details see 
[36], [221], [222]). 

1 ) Krylov Subspace Methods: Let xq be an initial approx- 
imation to the solution of (I1Q7I) , tq = b — Axq be the 
initial residual, and let IK^(A,ro) be the Krylov subspace 
of dimension m defined by 

Km{A, To) = Span{ro, Aro, . . . , A^-^ro} (112) 

where these subspaces are nested, i.e., C K^+i, (m = 
1,2,3,...). Krylov subspace methods are iterative methods in 
which x^, an approximation to the solution of (I1Q7I) , at the 
m^^ step, is found in xq + Km, namely the approximation is 
of the form 

X = xo + gm-i(A)ro (113) 

where, Qm-i is a polynomial of degree at most m — 1. If 
the system is real, then coefficients of Qm-i are real. This 
natural expression implies that the residual = b — Ax^ is 
associated with the so-called residual polynomial Pm of degree 
at most m with Pm(0) = 1 because 

r^ = b-Ax^ ro-Ap^(A)ro 

= Pm{A)ro (114) 

The error satisfies e = x^ — = p^(A)(xo — x*), 
where x* is the exact solution of (I1Q7I) . Let us denote by 
Vm the set of all polynomials p of degree at most m such 
that p{0) = 1. The approximation x^ G xq + Km is often 
found by requiring x^ to be the minimizer of some functional. 
There are different methods depending on the characteristics 
of the matrix and implementation. Thus, each method defines 
implicitly a different polynomial Pm ^ Vm (for details see 
[223]). 

The extra condition imposed for convergence and complete- 
ness are 

• Trm is orthogonal to Km (Galerkin condition: r^±IK^), 

• Minimum residual condition: 

rm= min ||b - Ax|| (115) 

xGxo+K 

m 

We note that the nested property of the Krylov subspaces 
imply that any method for which one of the conditions (I115t 
holds will terminate in at most n steps. The desired methods 
are those which produce a good approximation to the solution 
of (1 1071) in many fewer than n iterations. An important 
ingredient that makes Krylov subspace methods work is the 



use of preconditioners, a matrix or operator M. used to convert 
equation (11071) into (lllll) . There is no one method which is 
recommended for all problems. 

Some of the applications on Sparse matrices besides the one 
discussed in the main body of the paper are: 

C. Applications in Photonics and Electromagnetics 

Numerical simulation for the development of photonics and 
electromagnetic CAD software packages has been the subject 
of intensive research in the past decade. The most widely 
used simulation method is the Finite-Difference Time-Domain 
(FDTD) for time-domain analysis of Maxwell's equations. The 
more recently introduced schemes would require the solution 
of large sparse linear systems at each unit of time [224]. 
A popular method for time-domain problems is the time- 
domain beam propagation method [225], which necessitates 
a multiplication of the input field vector with a very sparse 
matrix. This sparse multiplication is advantageous in that the 
method can become very efficient and highly parallel [226]. 
Another method for the time-domain simulation of Maxwell's 
equations is the Finite-Element Time-Domain (FETD) tech- 
nique [224]. The FETD leads to an almost sparse linear 
system; the computational complexity of this method can be 
considerably reduced by inverting the sparse matrix [227]. 

D. Applications in Genomic Signal Processing [228] 

Micro-arrays (DNA and protein) are parallel biosensors 
capable of detecting a large number of different genomic 
particles simultaneously. DNA micro-arrays that use tens of 
thousands of probe spots detect multiple targets in a single 
experiment. This is a wasteful use of the sensing resources in 
comparative DNA micro-array experiments. Generally, only a 
fraction of the total number of genes with respect to a reference 
sample is differentially expressed, and, thus, a vast number 
of probe spots may not provide any useful information. An 
alternative design is the so-called compressed micro-arrays; 
this translates to significantly lower costs, simpler image 
acquisition and processing, and smaller amount of genomic 
material needed for experiments. To recover signals from 
compressed micro-array measurements, ideas from CS (Sec. 
in-Bb can be employed. For sparse measurement matrices, 
sparse algorithms can be used to lower the computational 
complexity than the widely used linear-programming-based 
methods [92], [228]. 
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TABLE XV 
List of Acronyms 



ADSL: 


Asynchronous Digital 


AIC: 


Akaike Information Criterion 




Subscriber Line 


AR: 


Auto-Regressive 


ARMA: 


Auto-Regressive Moving Average 


BSS: 


Blind Source Separation 


BW: 


Bandwidth 


CAD: 


Computer Aided Design 


CFAR: 


Constant False Alarm Rate 


CG: 


Conjugate Gradient 


CS: 


Compressed Sensing 


CT: 


Computer Tomography 


DAB: 


Digital Audio Broadcasting 


DC: 


Direct Current: Zero-Frequency 


DCT: 


Discrete Cosine Transform 




Coefficient 


DFT: 


Discrete Fourier Transform 


DHT: 


Discrete Hartley Transform 


DOA: 


Direction Of Arrival 


DST: 


Discrete Sine Transform 


DT: 


Discrete Transform 


DVB: 


Digital Video Broadcasting 


DWT: 


Discrete Wavelet Transform 


EEG: 


ElectroEncephaloGraphy 


ELP: 


Error Locator Polynomial 


ESPRIT: 


Estimation of Signal Parameters via 


FDTD: 


Finite-Difference Time-Domain 




Rotational Invariance Techniques 


FETD: 


Finite-Element Time-Domain 


FOCUSS: 


FOCal Under-determined System 


FPE: 


Final Prediction Error 




Solver 


GA: 


Genetic Algorithm 


HNQ: 


Hannan and Quinn method 


ICA: 


Independent Component Analysis 


IDE: 


Iterative Detection and Estimation 


IDT: 


Inverse Discrete Transform 


IMAT: 


Iterative Methods with Adaptive 


KLT: 


Karhunen Loeve Transform 




Thresholding 


h: 


Absolute Summable Discrete Signals 


h: 


Finite Energy Discrete Signals 


LDPC: 


Low Density Parity Check 


LP: 


Linear Programming 


MA: 


Moving Average 


MAP: 


Maximum A Posteriori 


MDL: 


Minimum Description Length 




probability 


MIMAT: 


Modified IMAT 


ML: 


Maximum Likelihood 


MMSE: 


Minimum Mean Squared Error 


MSL: 


Multi-Source Location 


MUSIC: 


Multiple Signal Classification 


NP: 


Non-Polynomial time 


OCT: 


Optical Coherence Tomography 


OFDM: 


Orthogonal Frequency Division 


OFDMA: 


Orthogonal Frequency Division 




Multiplex 




Multiple Access 


OMP: 


Orthogonal Matching Pursuit 


OSR: 


Over Samphng Ratio 


PCA: 


Principle Component Analysis 


PDF: 


Probability Density Function 


PHD: 


Pisarenko Harmonic Decomposition 


POCS: 


Projection Onto Convex Sets 


PPM: 


Pulse-Position Modulation 


RDE: 


Recursive Detection and Estimation 


RIP: 


Restricted Isometry Property 


RS: 


Reed-Solomon 


RV: 


Residual Variance 


SA: 


Simulated Annealing 


SCA: 


Sparse Component Analysis 


SDCT: 


Sorted DCT 


SOFT: 


Sorted DFT 


SDR: 


Sparse Dictionary Representation 


SER: 


Symbol Error Rate 


SI: 


Shift Invariant 


SLO: 


Smoothed £o-norm 


SNR: 


Signal-to-Noise Ratio 


ULA: 


Uniform Linear Array 


UWB: 


Ultra Wide Band 


WIMAX: 


Worldwide Inter-operabihty for 


WLAN: 


Wireless Local Area Network 




Microwave Access 


WMAN: 


Wireless Metropohtan Area 
Network 







